# Deep Review: 20260321-203726-pr-242

| | |
|---|---|
| **Date** | 2026-03-21 20:37 |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#242](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/242) — Run Lima BATS tests on Windows via MSYS2 |
| **Branch** | `msys2-ci` |
| **Commits** | `da32824` Fix Windows process signaling for Lima hostagent lifecycle<br>`3e5f589` Fix stopInstanceForcibly to terminate WSL2 distro and clean up files<br>`5f72824` Enable Lima BATS tests on Windows<br>`127a0a1` Pre-generate Lima SSH keys on Windows to avoid broken path conversion<br>`88a586b` Run Windows BATS tests via MSYS2 instead of WSL2<br>`a09dff8` Honor RDD_LOG_LEVEL in external controllers |
| **Reviewers** | Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro |
| **Verdict** | **Merge with fixes** — two low-risk regressions worth addressing before merge |
| **Wall-clock time** | 62 min 30 s |

---

## Consolidated Review

### Executive Summary

This PR rewires Windows hostagent lifecycle management: each hostagent runs in its own process group so stopping a VM no longer kills the RDD service, `stopInstanceForcibly` terminates the WSL2 distro and cleans up stale files before unregistering, SSH keys are pre-generated to avoid MSYS2 path conversion issues, and CI switches from WSL2-based BATS to MSYS2-based BATS. The implementation is thorough, with careful attention to crash-safety, PID recycling, and WSL2 deadlock prevention. Two regressions merit fixes: `waitAfterKill` inherits a potentially exhausted parent context, and `KillTree` errors on already-dead processes prevent PID file cleanup.

### Critical Issues

None.

### Important Issues

1. **`waitAfterKill` derives context from potentially exhausted parent** — `limavm_lifecycle.go:508-512` (important, regression) [Claude Opus 4.6]

```go
waitAfterKill := func() {
    killCtx, killCancel := context.WithTimeout(ctx, 10*time.Second)
    defer killCancel()
    r.waitForProcessExit(killCtx, name)
}
```

`waitAfterKill` derives `killCtx` from the parent `ctx`. If the parent context was cancelled externally (manager shutdown racing with reconciler), `killCtx` is immediately done and `waitForProcessExit` returns without waiting. The `forceStop` closure on line 489 already uses `context.Background()` for the same reason — `waitAfterKill` should do the same.

The consequence is minor: the process was already force-killed, and `stopWatcher` at the caller site (line 449) implicitly waits for exit via `cmd.Wait()`. But the bounded wait becomes a no-op instead of providing the guaranteed window the code intends.

Fix:

```diff
 waitAfterKill := func() {
-    killCtx, killCancel := context.WithTimeout(ctx, 10*time.Second)
+    killCtx, killCancel := context.WithTimeout(context.Background(), 10*time.Second)
     defer killCancel()
     r.waitForProcessExit(killCtx, name)
 }
```

2. **`KillTree` errors on already-dead processes, blocking PID file cleanup** — `limavm_lifecycle.go:590-598` (important, regression) [Gemini 3.1 Pro, verified]

```go
allKilled := true
for _, pid := range []int{inst.DriverPID, inst.HostAgentPID} {
    if pid > 0 {
        if err := process.KillTree(ctx, pid); err != nil {
            logger.V(1).Info("Failed to kill process tree", "pid", pid, "error", err)
            allKilled = false
        }
    }
}
```

`KillTree` returns an error when the target process has already exited (`ESRCH` on Unix, non-zero exit from `taskkill` on Windows). The `allKilled` guard at line 625 then skips PID file cleanup, leaving stale `.pid` and `.sock` files behind.

Gemini flagged this as CRITICAL, claiming the instance would be permanently stuck at `StatusBroken`. That claim is overstated: on the `stopInstance` path, the post-shutdown `store.Inspect` (line 452) derives status independently, and `handleUnwatchedState` (post-restart) calls `ReadPIDFile`, which returns `(0, err)` for dead Windows processes, so `stopInstanceForcibly` runs with PID=0 and cleanup proceeds. The system self-heals within one restart cycle.

The fix is still worthwhile: making `KillTree` idempotent is the correct semantic, and it avoids the needless stale-file accumulation.

Fix — make `KillTree` return nil when the target is already dead:

On Unix (`process_unix.go:37`):

```diff
 func KillTree(_ context.Context, pid int) error {
-    return unix.Kill(-pid, unix.SIGKILL)
+    err := unix.Kill(-pid, unix.SIGKILL)
+    if errors.Is(err, unix.ESRCH) {
+        return nil
+    }
+    return err
 }
```

On Windows (`process_windows.go:61`):

```diff
 func KillTree(ctx context.Context, pid int) error {
-    return exec.CommandContext(ctx, "taskkill", "/F", "/T", "/PID", strconv.Itoa(pid)).Run()
+    err := exec.CommandContext(ctx, "taskkill", "/F", "/T", "/PID", strconv.Itoa(pid)).Run()
+    if err != nil {
+        var exitErr *exec.ExitError
+        if errors.As(err, &exitErr) && exitErr.ExitCode() == 128 {
+            return nil
+        }
+        return err
+    }
+    return nil
 }
```

### Suggestions

1. **`Kill` on Windows does not detect `WaitForSingleObject` timeout** — `process_windows.go:51-54` (suggestion, gap) [Claude Opus 4.6]

```go
_, err = windows.WaitForSingleObject(hProcess, uint32(10*time.Second/time.Millisecond))
if err != nil {
    return fmt.Errorf("timed out waiting for process %d to terminate: %w", pid, err)
}
```

`WaitForSingleObject` returns `(WAIT_TIMEOUT, nil)` when the timeout expires — the first return value is discarded. `Kill` returns nil (success) even though the process has not exited. This is pre-existing code (Mark Yen, commit `1b9d99af`), and all current callers mitigate it by subsequently polling for exit.

Fix (if desired):

```diff
-_, err = windows.WaitForSingleObject(hProcess, uint32(10*time.Second/time.Millisecond))
-if err != nil {
-    return fmt.Errorf("timed out waiting for process %d to terminate: %w", pid, err)
-}
+result, err := windows.WaitForSingleObject(hProcess, uint32(10*time.Second/time.Millisecond))
+if err != nil {
+    return fmt.Errorf("failed waiting for process %d to terminate: %w", pid, err)
+}
+if result == uint32(windows.WAIT_TIMEOUT) {
+    return fmt.Errorf("timed out waiting for process %d to terminate", pid)
+}
```

2. **`forceStop` closure mutates captured `inst` parameter** — `limavm_lifecycle.go:491-493` (suggestion, regression) [Claude Opus 4.6]

```go
forceStop := func() {
    ...
    if inst == nil {
        var err error
        inst, err = store.Inspect(forceCtx, name)
```

The `forceStop` closure captures and reassigns the `inst` parameter from `shutdownHostagent`. No code reads `inst` after `forceStop()` runs today, so this is correct but fragile — future changes that access `inst` after line 519 would read the mutated value. A local variable inside the closure would be clearer.

---

### Design Observations

**Strengths**

- **Process group isolation is well-layered.** `SetGroup` / `Interrupt` / `Kill` / `KillTree` form a clean escalation chain: SIGINT/CTRL_BREAK, SIGTERM/TerminateProcess, SIGKILL/taskkill. Each layer has a clear purpose, and the platform-specific behavior is encapsulated in one file per OS. [Claude Opus 4.6]

- **WSL2 deadlock prevention is principled.** The "always terminate before unregister" approach at `handleDeletion` lines 47-52 and the `wslservice.exe` kill recovery in the BATS cleanup test address a known Windows kernel-level issue without introducing new failure modes. [Claude Opus 4.6, Codex GPT 5.4]

- **PID zeroing before Lima's Delete prevents PID recycling kills.** Setting `DriverPID = 0` and `HostAgentPID = 0` at lines 55-56 before calling `limainstance.Delete` stops Lima's internal `StopForcibly` from signaling unrelated processes if the OS recycled those PIDs. [Gemini 3.1 Pro, Codex GPT 5.4]

- **SSH key pre-generation is crash-safe.** The two-stage rename in `ssh_keys_windows.go` (lines 65-74) handles every crash-recovery scenario: orphaned public key without private key, stale temp files, corrupt private key. Each path converges on a clean state. [Claude Opus 4.6, Gemini 3.1 Pro]

- **`os.ReadDir` instead of `filepath.Glob` for cleanup.** Using `os.ReadDir` at line 631 avoids silent failures on paths containing bracket characters (e.g., `C:\Users\name[1]`). [Gemini 3.1 Pro]

**Concerns**

- **`exec.CommandContext` races with graceful shutdown** (future) — The hostagent starts with `exec.CommandContext(ctx, ...)` at line 397. When the manager context is cancelled during shutdown, Go's `CommandContext` sends `os.Kill` (SIGKILL on Unix, TerminateProcess on Windows) immediately, before `shutdownAllHostagents` can send a graceful signal. Hostagents are hard-killed on shutdown instead of following the graceful CTRL_BREAK, WSL terminate, cleanup path. Setting `cmd.Cancel` to send SIGINT/CTRL_BREAK and adding `cmd.WaitDelay` would let hostagents run their own shutdown before `shutdownAllHostagents` escalates. This is pre-existing (line 397 predates this PR) but the PR makes it more impactful by adding Windows-specific cleanup that only runs on the graceful path. [Claude Opus 4.6] `(future)`

---

### Testing Assessment

Test coverage is strong. The BATS tests cover start, stop, crash recovery, orphaned hostagent recovery, graceful shutdown, restart, template change detection, and deletion — all against the actual VM runtime. Windows-specific recovery (the cleanup test at lines 521-526) wraps `rdd ctl wait --for=delete` with an outer timeout and kills `wslservice.exe` if the wait hangs.

Untested scenarios by risk:

1. **SSH key bootstrap with partial on-disk state** — No focused test exercises partial states like `user` present but `user.pub` missing, or stale `user.tmp*` files from a prior crash. Only covered indirectly through integration startup. [Codex GPT 5.4]
2. **Forced-stop fallback path** — No test exercises the path where `process.Interrupt()` succeeds but the hostagent never reaches `Stopped`, causing `shutdownHostagent()` to fall through to `stopInstanceForcibly()`. [Codex GPT 5.4]
3. **Graceful vs. forced termination distinction** — The orphaned hostagent test kills the service with `process_kill` (taskkill /F) and verifies recovery, but does not distinguish whether the hostagent was stopped gracefully (via CTRL_BREAK) or forcibly. [Claude Opus 4.6]
4. **Concurrent template change during shutdown** — No test modifies the template ConfigMap while the service shuts down. [Claude Opus 4.6]
5. **Outer-timeout wslservice recovery path** — The BATS recovery path that kills `wslservice.exe` after a timeout (line 513-524) is only exercised if WSL2 actually deadlocks during the test run. [Codex GPT 5.4]

---

### Documentation Assessment

No gaps. The in-code comments are thorough: the WSL2 deadlock rationale (lines 47-51), PID zeroing rationale (lines 53-54), and `stopInstanceForcibly` design notes (lines 579-589) provide sufficient context for future maintainers.

---

### Commit Structure

The six commits form a clean logical progression: process signaling fix, WSL2 lifecycle fix, Lima test enablement, SSH key workaround, CI migration, logging propagation. Each commit is self-contained, and each message accurately describes its changes. No fixup commits.

---

### Acknowledged Limitations

1. **Sequential hostagent shutdown** — `limavm_controller.go:502-503`:
   > `// TODO: Wait on all hostagents in parallel instead of sequentially; with the current loop, the total wait is N × gracefulShutdownTimeout in the worst case.`

   This PR does not change the sequential wait. With the new Windows support adding WSL2 cleanup per instance, the sequential cost increases slightly. The worst case (N hanging hostagents times 30 s) is more likely on Windows where WSL subsystem degradation can prevent clean exits. [Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro]

2. **Non-blocking stop** — `limavm_lifecycle.go:442-443`:
   > `// TODO: Non-blocking stop: send SIGINT and return immediately; the watcher detects the Exiting event and triggers a reconcile.`

   Pre-existing. The PR adds `shutdownHostagent`, which encapsulates the blocking stop logic and makes a future non-blocking refactor easier. [Claude Opus 4.6, Codex GPT 5.4]

---

## Agent Performance Retro

### [Claude]

Claude produced the most thorough and nuanced review. It identified the `waitAfterKill` context issue (the only important finding unique to one agent), the `WaitForSingleObject` gap, the `forceStop` closure mutation concern, and the `exec.CommandContext` shutdown race. It traced call chains across multiple files and verified findings with `git blame`. All findings included accurate line references and concrete fix suggestions.

Claude did not catch the `KillTree` idempotency gap that Gemini found.

### [Codex]

Codex completed the review in under 5 minutes — three to four times faster than the other agents — but returned zero findings. Its testing assessment identified useful coverage gaps, and its design observations correctly noted the split between graceful and forced shutdown. However, it missed all actionable code issues that the other agents found.

### [Gemini]

Gemini identified the `KillTree` idempotency gap, the most impactful regression in this PR. It traced the `allKilled` guard logic, identified that `KillTree` errors on dead processes, and proposed concrete platform-specific fixes. However, it significantly overstated the severity: it claimed the instance would be "permanently blocked," but tracing through the callers shows the system self-heals within one restart cycle. The claim that `os.FindProcess` "always succeeds" on Windows is also inaccurate — it calls `OpenProcess`, which fails for dead processes once all handles are closed.

### Summary

| Metric | Claude Opus 4.6 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 11:05 | 4:57 | 16:24 |
| Critical | 0 | 0 | 1 (downgraded to Important) |
| Important | 1 | 0 | 0 |
| Suggestion | 2 | 0 | 0 |
| Design observations | 5 strengths, 1 concern | 2 strengths | 3 strengths |
| False positives | 0 | 0 | 0 (but 1 severity inflation) |
| Unique insights | 3 (waitAfterKill, WaitForSingleObject, CommandContext race) | 0 | 1 (KillTree idempotency) |

Claude provided the most value overall: it caught the most findings, made no false claims, and produced detailed design observations. Gemini's unique contribution (KillTree idempotency) was the single most impactful regression finding, but its severity analysis was unreliable. Codex was the fastest but produced no actionable findings — useful as a confidence signal when it agrees with the other agents, but insufficient as a sole reviewer.

---

## Improvement Recommendations

**Repo context updates**

- [repo] On Windows, `os.FindProcess` calls `OpenProcess` and fails for dead processes once all handles are closed. Lima's `ReadPIDFile` does not auto-clean stale PID files on Windows (unlike Unix, where it removes the file and returns 0). Review agents should check for this asymmetry when reviewing process lifecycle code that depends on PID file presence for status derivation.

---

## Appendix: Original Reviews

### Claude Opus 4.6

# PR #242: Run Lima BATS tests on Windows via MSYS2

## Executive Summary

This PR fixes critical Windows-specific process signaling so that stopping or restarting a VM no longer kills the RDD service itself, adds WSL2 deadlock prevention by always terminating distros before unregistering, switches CI from WSL2-based BATS to MSYS2-based BATS, and propagates `RDD_LOG_LEVEL` to external controllers. The implementation is thorough and well-documented, with careful attention to crash-safety and platform-specific edge cases.

**Verdict: Approve** — no blocking issues found.

---

## Findings

### Critical Issues

None.

### Important Issues

1. **`waitAfterKill` inherits potentially-exhausted parent context** — `limavm_lifecycle.go:508-512` (important, regression)

```go
waitAfterKill := func() {
    killCtx, killCancel := context.WithTimeout(ctx, 10*time.Second)
    defer killCancel()
    r.waitForProcessExit(killCtx, name)
}
```

`waitAfterKill` derives `killCtx` from the parent `ctx`. If the graceful shutdown path consumed most of the parent context's deadline (e.g., the 30-second wait at line 515-517), `waitAfterKill`'s 10-second timeout may be effective. But if the parent context was cancelled externally (manager shutdown races with reconciler), `killCtx` is immediately done and `waitForProcessExit` returns without waiting.

The consequence is minor — the process was already force-killed, and the watcher's `cmd.Wait()` (called by `stopWatcher` at the caller's site) implicitly waits for the process to exit. But the `waitAfterKill` becomes a no-op rather than the bounded wait the code intends.

Fix: Use `context.Background()` (matching `forceStop`'s approach on the line above), since the kill already happened and the wait should always be bounded by its own timeout, not the parent's remaining budget.

```diff
 waitAfterKill := func() {
-    killCtx, killCancel := context.WithTimeout(ctx, 10*time.Second)
+    killCtx, killCancel := context.WithTimeout(context.Background(), 10*time.Second)
     defer killCancel()
     r.waitForProcessExit(killCtx, name)
 }
```

### Suggestions

1. **`Kill` on Windows does not detect `WaitForSingleObject` timeout** — `process_windows.go:51-54` (suggestion, gap)

```go
_, err = windows.WaitForSingleObject(hProcess, uint32(10*time.Second/time.Millisecond))
if err != nil {
    return fmt.Errorf("timed out waiting for process %d to terminate: %w", pid, err)
}
```

`WaitForSingleObject` returns `(WAIT_TIMEOUT, nil)` when the timeout expires — the first return value is discarded. If `TerminateProcess` at line 48 didn't complete within 10 seconds, `Kill` returns nil (success) even though the process hasn't fully exited. This is pre-existing code (Mark Yen, commit 1b9d99af), and all callers mitigate it by subsequently polling for exit. Noting for awareness.

Fix (if desired):
```diff
-_, err = windows.WaitForSingleObject(hProcess, uint32(10*time.Second/time.Millisecond))
-if err != nil {
-    return fmt.Errorf("timed out waiting for process %d to terminate: %w", pid, err)
-}
+result, err := windows.WaitForSingleObject(hProcess, uint32(10*time.Second/time.Millisecond))
+if err != nil {
+    return fmt.Errorf("failed waiting for process %d to terminate: %w", pid, err)
+}
+if result == uint32(windows.WAIT_TIMEOUT) {
+    return fmt.Errorf("timed out waiting for process %d to terminate", pid)
+}
```

2. **`forceStop` closure mutates captured `inst` variable** — `limavm_lifecycle.go:491-493` (suggestion, regression)

```go
forceStop := func() {
    ...
    if inst == nil {
        var err error
        inst, err = store.Inspect(forceCtx, name)
```

The `forceStop` closure captures the `inst` parameter by reference and mutates it (assigns a new value from `store.Inspect`). While correct today (no code reads `inst` after `forceStop` runs), the implicit mutation through a closure is fragile — future edits could introduce a subtle bug by accessing `inst` below the `forceStop()` call. A local variable inside the closure would be clearer.

---

## Design Observations

### Strengths

- **Process group isolation is well-layered.** `SetGroup` / `Interrupt` / `Kill` / `KillTree` form a clean escalation chain: SIGINT/CTRL_BREAK → SIGTERM/TerminateProcess → SIGKILL/taskkill. Each layer has a clear purpose and the platform-specific behavior is encapsulated in one file per OS.

- **WSL2 deadlock prevention is principled.** The "always terminate before unregister" approach at `handleDeletion` lines 47-52 and the wslservice.exe kill recovery in the BATS cleanup test address a known Windows kernel-level issue (documented in memory) without introducing new failure modes. The 60-second `deleteCtx` timeout prevents the caller from blocking indefinitely.

- **SSH key pre-generation is crash-safe.** The two-stage rename in `ssh_keys_windows.go` (lines 65-74) handles every crash-recovery scenario correctly: orphaned public key without private key, stale temp files from prior crash, corrupt private key. Each path converges on a clean state.

- **PID zeroing before Lima's Delete is a smart defense.** At `handleDeletion` lines 55-56, zeroing `DriverPID` and `HostAgentPID` after RDD's own force-stop prevents Lima's internal `StopForcibly` from sending signals to recycled PIDs — a real risk on Windows where PIDs cycle fast.

- **`allKilled` guard in `stopInstanceForcibly`** (lines 625-627) correctly trades completeness for safety: if process kill fails, PID files are preserved so `store.Inspect` doesn't falsely report StatusStopped for a still-running process.

### Concerns

- **`exec.CommandContext` races with graceful shutdown** (future) — The hostagent is started with `exec.CommandContext(ctx, ...)` at line 397 of `limavm_lifecycle.go`. When the manager context is cancelled during shutdown, Go's `CommandContext` sends `os.Kill` (SIGKILL on Unix, TerminateProcess on Windows) immediately, before `shutdownAllHostagents` has a chance to send a graceful signal. This means hostagents are hard-killed on shutdown rather than getting the graceful CTRL_BREAK → WSL terminate → cleanup path. Setting `cmd.Cancel` to send SIGINT/CTRL_BREAK and adding a `cmd.WaitDelay` would let the hostagent run its own shutdown before `shutdownAllHostagents` also signals it. This is pre-existing (line 397 predates this PR) but the PR makes it more impactful by adding Windows-specific cleanup (WSL terminate, PID file removal) that only runs on the graceful path.

---

## Testing Assessment

Test coverage is solid. The BATS tests cover:
- Start, stop, crash recovery, orphaned hostagent recovery, graceful shutdown, restart, template change detection, and deletion — all on the actual VM runtime.
- Windows-specific recovery: the cleanup test (line 521-526) wraps `rdd ctl wait --for=delete` with an outer timeout and kills `wslservice.exe` if the wait hangs.
- Cross-platform process management helpers (`assert_process_alive`, `process_kill`) use native tools (`tasklist.exe`/`taskkill.exe`) on Windows instead of MSYS2's `kill`.

Untested scenarios (by risk):
1. **Orphaned hostagent recovery on Windows** — the test kills the service with `process_kill` (taskkill /F) and verifies the orphaned hostagent is detected after restart. But `process.Interrupt` (CTRL_BREAK) on an orphaned hostagent whose creating process has exited is not independently verified to work. If `GenerateConsoleCtrlEvent` doesn't reach the orphan, the test still passes because the 30s graceful timeout falls through to `stopInstanceForcibly`. A test that verifies the graceful path specifically (e.g., checking hostagent logs for clean shutdown messages) would distinguish graceful from forced termination.
2. **Concurrent template change during shutdown** — no test modifies the template ConfigMap while the service is shutting down. The `handleTemplateUpdate` deferred-version logic is only tested during normal operation.

---

## Documentation Assessment

No documentation gaps. The in-code comments are thorough — the WSL2 deadlock rationale (lines 47-51), PID zeroing rationale (lines 53-54), and `stopInstanceForcibly` design notes (lines 579-589) all provide sufficient context for future maintainers.

---

## Acknowledged Limitations

1. **Sequential hostagent shutdown** — `limavm_controller.go:505`:
   > `// TODO: Wait on all hostagents in parallel instead of sequentially; with the current loop, the total wait is N × gracefulShutdownTimeout in the worst case.`

   This PR doesn't change the sequential wait. With the new Windows support adding WSL2 cleanup per instance, the sequential cost increases slightly. The worst case (N hanging hostagents × 30s) is more likely on Windows where WSL subsystem degradation can prevent clean exits.

2. **Non-blocking stop** — `limavm_lifecycle.go:442-443`:
   > `// TODO: Non-blocking stop: send SIGINT and return immediately; the watcher detects the Exiting event and triggers a reconcile.`

   Pre-existing. The PR doesn't change this but adds `shutdownHostagent` which encapsulates the blocking stop logic, making a future non-blocking refactor easier.

---

## Commit Structure

The six commits form a clean logical progression: process signaling fix → WSL2 lifecycle fix → Lima test enablement → SSH key workaround → CI migration → logging propagation. Each commit is self-contained and its message accurately describes the changes. No fixup commits.

---

### Codex GPT 5.4

**Verdict**

Looks safe to merge.

### Executive Summary

This PR rewires Windows Lima shutdown handling around process groups, adds WSL2-specific forced cleanup, and moves Windows BATS execution from WSL to MSYS2. I traced the hostagent/service shutdown paths, the new `stopInstanceForcibly()` flow, Lima's upstream process-launching behavior, and the new SSH key/bootstrap logic, and I did not find a merge-blocking regression in the reviewed commits.

### Findings

**Critical Issues**

None.

**Important Issues**

None.

**Suggestions**

None.

### Design Observations

**Strengths**

The split between graceful shutdown (`process.Interrupt`) and forced cleanup (`stopInstanceForcibly`) is a good fit for the actual failure mode here. In particular, replacing Lima's Windows-wide `CTRL_BREAK_EVENT` behavior with per-process-group signaling avoids taking down the parent RDD process while still preserving graceful hostagent exits where possible.

The deletion path change in `limavm_lifecycle.go` is well targeted: force-terminating WSL before unregistering and then zeroing stale PIDs before calling Lima's `Delete()` closes the deadlock/recycled-PID hazards without changing the higher-level controller contract.

### Testing Assessment

1. The new Windows SSH-key bootstrap in `ssh_keys_windows.go` is only covered indirectly through integration startup. I did not find a focused test for partial on-disk states like `user` present but `user.pub` missing or stale `user.tmp*` files.
2. I did not find an automated case that exercises the forced-stop fallback where `process.Interrupt()` succeeds but the hostagent never reaches `Stopped`, so `shutdownHostagent()` falls through to `stopInstanceForcibly()`.
3. The MSYS2 workflow validates the happy path for Windows BATS, but I did not see a targeted test for the outer-timeout recovery path that kills `wslservice.exe`.

### Documentation Assessment

None.

### Acknowledged Limitations

- "`TODO: Wait on all hostagents in parallel instead of sequentially; with the current loop, the total wait is N × gracefulShutdownTimeout in the worst case.`" in `limavm_controller.go:504`. This limitation is unchanged by the PR; the new cleanup makes timeout handling better, but total shutdown latency still scales linearly with VM count.
- "`TODO: Non-blocking stop: send SIGINT and return immediately; the watcher detects the Exiting event and triggers a reconcile.`" in `limavm_lifecycle.go:211` and `limavm_lifecycle.go:442`. The PR keeps stop synchronous; the new forced-stop behavior reduces hang risk, but stop latency is still paid in-band by the reconciler.

---

### Gemini 3.1 Pro

### Executive Summary

This PR successfully introduces MSYS2-based BATS testing on Windows, cleanly isolates Windows hostagent processes using process groups, and implements targeted process termination. However, the custom termination logic skips cleanup of temporary files if the process is already dead, which can lead to orphaned instances getting permanently stuck in a broken state on Windows.

### Critical Issues

1. **`process.KillTree` is not idempotent, causing deadlocks during process cleanup** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:590-599` `(critical, regression)`

```go
	allKilled := true
	for _, pid := range []int{inst.DriverPID, inst.HostAgentPID} {
		if pid > 0 {
			if err := process.KillTree(ctx, pid); err != nil {
				logger.V(1).Info("Failed to kill process tree", "pid", pid, "error", err)
				allKilled = false
			}
		}
	}
```

`stopInstanceForcibly` skips deleting temporary files (`ha.pid`, `ha.sock`) if `process.KillTree` returns an error (`allKilled = false` at line 596; checked at line 625). However, `KillTree` returns an error if the process has already exited (`syscall.ESRCH` on Unix, exit code 128 from `taskkill` on Windows).

If the hostagent crashes or exits before `stopInstanceForcibly` is called, `KillTree` fails, and the stale PID files are left behind. On Windows, Lima's `store.Inspect` does not auto-clean stale PID files (because `os.FindProcess` always succeeds in returning a handle), causing the instance to permanently report `StatusBroken`. The controller will then infinitely loop trying to stop it, permanently blocking the instance from recovering.

Fix: Make `process.KillTree` idempotent by returning `nil` when the process is already dead.

On Unix (`pkg/util/process/process_unix.go:37-39`):
```go
// KillTree sends SIGKILL to the process group led by the given PID.
// The process must have been started with SetGroup so it leads its own group.
func KillTree(_ context.Context, pid int) error {
	err := unix.Kill(-pid, unix.SIGKILL)
	if err == unix.ESRCH {
		return nil
	}
	return err
}
```

On Windows (`pkg/util/process/process_windows.go:61-63`):
```go
// KillTree terminates the process and all its descendants.
// Uses taskkill with /T (tree kill) and /F (force).
func KillTree(ctx context.Context, pid int) error {
	err := exec.CommandContext(ctx, "taskkill", "/F", "/T", "/PID", strconv.Itoa(pid)).Run()
	if err != nil {
		var exitErr *exec.ExitError
		// taskkill returns exit code 128 if the process is not found.
		if errors.As(err, &exitErr) && exitErr.ExitCode() == 128 {
			return nil
		}
		return err
	}
	return nil
}
```

### Important Issues
None.

### Suggestions
None.

### Design Observations

**Strengths**
- **PID recycling safety:** Setting `existingInst.DriverPID = 0` and `existingInst.HostAgentPID = 0` before calling `limainstance.Delete` on line 54 of `limavm_lifecycle.go` is an excellent fix. It prevents Lima's internal deletion logic from killing unrelated Windows processes if the OS had recycled those PIDs.
- **Robust file iteration:** Using `os.ReadDir` instead of `filepath.Glob` on line 631 of `limavm_lifecycle.go` to clean up temporary files correctly avoids silent failures when instance directories contain bracket characters (e.g., `C:\Users\name[1]`).
- **SSH key derivation:** Falling back to `ssh-keygen -y` to derive a missing public key from an existing private key in `ssh_keys_windows.go` smartly preserves SSH access to existing VMs rather than needlessly destroying the keypair.

### Testing Assessment
Test coverage is strong. The BATS tests intelligently account for potential WSL2 deadlocks (`limavm-running.bats:516-524`) and explicitly assert that orphaned hostagents can be recovered natively.

### Documentation Assessment
No missing documentation.

### Commit Structure
No issues found.

### Acknowledged Limitations
- **Sequential graceful shutdown:** "TODO: Wait on all hostagents in parallel instead of sequentially; with the current loop, the total wait is N × gracefulShutdownTimeout in the worst case." (`pkg/controllers/lima/limavm/controllers/limavm_controller.go:502-503`).