# Deep Review: 20260321-165509-pr-242

| | |
|---|---|
| **Date** | 2026-03-21 16:55 |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#242](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/242) — Run Lima BATS tests on Windows via MSYS2 |
| **Branch** | `msys2-ci` |
| **Commits** | `76ef79d` Fix Windows process signaling for Lima hostagent lifecycle<br>`d9fbdb3` Fix stopInstanceForcibly to terminate WSL2 distro and clean up files<br>`297dac3` Enable Lima BATS tests on Windows<br>`c207006` Pre-generate Lima SSH keys on Windows to avoid broken path conversion<br>`7962cc8` Run Windows BATS tests via MSYS2 instead of WSL2<br>`6dea712` Honor RDD_LOG_LEVEL in external controllers |
| **Reviewers** | Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro |
| **Verdict** | **Merge with fixes** — two important issues in forced-stop and timeout escalation paths |
| **Wall-clock time** | 11 min 40 s |

---

## Consolidated Review

### Executive Summary

This PR fixes Windows process signaling (isolating hostagents in their own process groups), hardens WSL2 cleanup (always terminating before unregistering to prevent `wslservice.exe` deadlocks), adds Windows-specific SSH key pre-generation, and migrates CI from WSL2-based to MSYS2-based BATS tests. The code is well-structured and the design decisions are sound. Two important issues warrant attention: unconditional PID file cleanup after failed process kills can mask a still-running instance, and the `StopWithWait` timeout fallback sends SIGTERM instead of SIGKILL on Unix.

### Critical Issues

None.

### Important Issues

1. **Unconditional PID file cleanup after failed process kill** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:616-636` [Codex GPT 5.4] (important, regression)

```go
for _, pid := range []int{inst.DriverPID, inst.HostAgentPID} {
    if pid > 0 {
        if err := process.KillTree(ctx, pid); err != nil {
            logger.V(1).Info("Failed to kill process tree", "pid", pid, "error", err)
        }
    }
}
// ... later, unconditionally:
for _, f := range entries {
    for _, suffix := range filenames.TmpFileSuffixes {
        if strings.HasSuffix(f.Name(), suffix) {
            path := filepath.Join(inst.Dir, f.Name())
            _ = os.Remove(path)
```

`stopInstanceForcibly` logs but ignores `process.KillTree` failures at lines 588-591, then unconditionally deletes all `.pid`, `.sock`, and `.tmp` files at lines 624-635. Lima's `store.Inspect` derives `StatusStopped` from missing PID files (via `inspectStatusWithPIDFiles` in Lima's `pkg/store/instance.go`). If `KillTree` fails but PID files are removed, `store.Inspect` reports `StatusStopped` while the process is still alive.

**Mitigating factors:** On WSL2 (the primary target of this PR), Lima's WSL2 driver has its own `InspectStatus` that checks actual WSL state, bypassing the PID file fallback entirely. The VZ driver on macOS returns `""` and falls through to PID file inspection, where this bug applies. However, `KillTree` on Unix sends SIGKILL to the process group, which rarely fails for processes the controller spawned.

The most exposed caller is `killOrphanedHostagent` (line 550), which calls `stopInstanceForcibly` and returns without further verification. If `KillTree` fails here, the next reconcile sees `StatusStopped` and may start a new hostagent against a still-running one.

Fix: Track whether any kill failed and skip PID file cleanup on failure, so Lima continues to report the true state:

```diff
+	allKilled := true
 	for _, pid := range []int{inst.DriverPID, inst.HostAgentPID} {
 		if pid > 0 {
 			if err := process.KillTree(ctx, pid); err != nil {
 				logger.V(1).Info("Failed to kill process tree", "pid", pid, "error", err)
+				allKilled = false
 			}
 		}
 	}
 	// ...
+	if !allKilled {
+		logger.Info("Skipping tmp file cleanup because process kill failed")
+		return
+	}
 	entries, err := os.ReadDir(inst.Dir)
```

2. **StopWithWait timeout sends SIGTERM instead of SIGKILL on Unix** — `pkg/service/cmd/service.go:363` [Claude Opus 4.6] (important, regression)

```go
case <-timeout:
    // Graceful shutdown timed out; force-kill so we don't leave
    // a hung service process.
    _ = process.Kill(pid)
```

After 60 seconds waiting for graceful shutdown, `StopWithWait` calls `process.Kill(pid)`, which sends SIGTERM on Unix (line 32 of `process_unix.go`). The service already received SIGINT via `process.Interrupt` at line 345. Escalating to another catchable signal after a 60-second timeout does not force termination. On Windows this is not an issue because `process.Kill` uses `TerminateProcess`, which is unconditional. Git blame confirms `76ef79d` introduced this line (the pre-PR code had no kill-on-timeout at all).

Fix: Use `KillTree` to send SIGKILL on Unix and `taskkill /F /T` on Windows:

```diff
 			case <-timeout:
 				// Graceful shutdown timed out; force-kill so we don't leave
 				// a hung service process.
-				_ = process.Kill(pid)
+				_ = process.KillTree(context.Background(), pid)
```

---

### Suggestions

1. **waitForInstanceStopped aborts on any store.Inspect error** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:565-566` [Gemini 3.1 Pro] (suggestion, regression)

```go
inst, err := store.Inspect(ctx, name)
if err != nil {
    return false
}
```

`waitForInstanceStopped` treats all `store.Inspect` errors as terminal, immediately returning `false` and triggering the forced-kill path. A transient error (e.g., the hostagent atomically rewriting a status file) causes unnecessary forced termination. In practice, `store.Inspect` errors on own instances are rare, and aborting early is conservative rather than dangerous, so this is low-risk.

Fix: Continue polling on transient errors instead of aborting:

```diff
 			inst, err := store.Inspect(ctx, name)
 			if err != nil {
-				return false
+				continue
 			}
```

2. **forceStop closure captures parent context for store.Inspect** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:489` [Claude Opus 4.6] (suggestion, regression)

```go
forceStop := func() {
    if inst == nil {
        var err error
        inst, err = store.Inspect(ctx, name)
```

The `ctx` in `shutdownHostagent`'s `forceStop` closure is the reconciler context. If the context is nearing its deadline when `forceStop` runs (after a 30-second graceful timeout), `store.Inspect` fails and skips WSL2 cleanup entirely. Low risk because the reconciler context is typically long-lived and `shutdownAllHostagents` already demonstrates the correct pattern at line 522 (using `context.Background()`).

Fix: Use a bounded background context for the force-stop path:

```diff
 	forceStop := func() {
 		if inst == nil {
 			var err error
-			inst, err = store.Inspect(ctx, name)
+			forceCtx, forceCancel := context.WithTimeout(context.Background(), 30*time.Second)
+			defer forceCancel()
+			inst, err = store.Inspect(forceCtx, name)
```

---

### Design Observations

**Strengths**

- **Process group isolation** — Replacing console-wide `CTRL_BREAK` with targeted `CTRL_BREAK_EVENT` per process group eliminates the entire class of "stopping a VM kills the RDD service" bugs. The `process` package API (`SetGroup`, `Interrupt`, `Kill`, `KillTree`) provides clean cross-platform abstractions with minimal, correct implementations on each platform. [Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro]

- **WSL2 deadlock prevention** — Always terminating before unregistering, wrapping `wsl.exe` calls with timeouts, and including a nuclear `wslservice.exe` kill in BATS cleanup reflects hard-won operational experience. The defensive layering means no single failure can block the test suite or the controller indefinitely. [Claude Opus 4.6]

- **Crash-safe SSH key generation** — `ensureSSHKeys` in `ssh_keys_windows.go` handles all partial-state combinations: temp file cleanup before generation, rename with public-key-first ordering, and fallback regeneration if the public key is missing. [Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro]

- **VM template extraction** — Moving template YAML into `bats/helpers/vm_template.bash` with the `RDD_WSL_DISTRO` switch cleanly supports both finch and opensuse rootfs without duplicating templates across test files. [Claude Opus 4.6]

**Concerns**

1. **Stale PID risk in shutdownHostagent after graceful timeout** — When `shutdownHostagent` waits 30 seconds for graceful exit and then calls `forceStop()`, the `inst` object passed at function entry may be stale. If the hostagent exited moments before the timeout, the PID could theoretically be recycled and `KillTree` would target an unrelated process. The practical window is milliseconds (between hostagent exit near the timeout boundary and the `KillTree` call), and SIGKILL/`taskkill` rarely fail for own child processes, so the risk is very low. A fresh `store.Inspect` inside `forceStop` would close this window but adds a `store` dependency to what is currently a pure cleanup function. `(future)` [Gemini 3.1 Pro — originally rated CRITICAL; downgraded after verification that `handleDeletion` fetches a fresh instance at line 42 and `shutdownHostagent`'s window is milliseconds-small]

---

### Testing Assessment

The BATS tests adapt well to cross-platform execution with `assert_process_alive`/`process_kill` replacing Unix-only signals and `editor_cmd` handling MSYS2 path conversion.

Untested scenarios, ranked by risk:

1. **Failed KillTree followed by PID file cleanup** — No test exercises the path where `process.KillTree` fails but `stopInstanceForcibly` still removes PID files, creating a false `StatusStopped`. This is the highest-risk untested path.
2. **StopWithWait timeout escalation on Unix** — No test covers the 60-second timeout path in `StopWithWait` that now sends SIGTERM instead of SIGKILL.
3. **Concurrent `ensureSSHKeys` calls** — Two controller processes calling `ensureSSHKeys` simultaneously could race on the temp file path. Low risk given the single-instance deployment model.

---

### Documentation Assessment

Code comments are thorough throughout. The Windows signal model, WSL2 deadlock rationale, and MSYS2 path conversion workarounds are all well-documented. The PR description provides clear context for each commit. No documentation gaps.

---

### Commit Structure

The six commits form a clean progression:

1. `76ef79d` — Core Windows process signaling fix
2. `d9fbdb3` — WSL2 cleanup hardening (builds on #1)
3. `297dac3` — Enable Lima BATS on Windows (test infra)
4. `c207006` — SSH key pre-generation (MSYS2 path workaround)
5. `7962cc8` — CI: MSYS2 instead of WSL2
6. `6dea712` — Log level propagation (independent improvement)

Each commit is self-contained and the messages accurately describe the changes. No fixup commits.

---

### Acknowledged Limitations

- `pkg/controllers/lima/limavm/controllers/limavm_controller.go:504`: `"TODO: Wait on all hostagents in parallel instead of sequentially; with the current loop, the total wait is N × gracefulShutdownTimeout in the worst case."` — Pre-existing. This PR does not worsen it.

- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:442`: `"TODO: Non-blocking stop: send SIGINT and return immediately; the watcher detects the Exiting event and triggers a reconcile."` — Pre-existing. This PR improves the stop mechanics but keeps the reconciler blocked on shutdown completion.

---

### Unresolved Feedback

No review comments on this PR.

---

## Agent Performance Retro

### [Claude] Opus 4.6

**Duration:** 3 min 37 s

Claude produced the most findings (3) and was the only agent to identify the `StopWithWait` timeout escalation issue and the `forceStop` context concern. Its analysis was methodical: it traced through callers, verified findings with git blame, and correctly confirmed that the `defer cancel()` in `stopInstanceForcibly` is not a bug. One finding (#3 in the original) was purely confirmatory analysis rather than an actionable issue. No false positives.

Claude identified five design strengths with specific code references and clear reasoning. Its testing assessment was accurate, including correctly dismissing the PID recycling concern in `waitForProcessExit` after tracing the implementation.

### [Codex] GPT 5.4

**Duration:** 4 min 22 s

Codex produced a single finding but it was the most impactful: unconditional PID file cleanup masking failed kills. The analysis was precise — it traced through Lima's `store.Inspect` → `inspectStatusWithPIDFiles` to demonstrate that missing PID files cause false `StatusStopped` reports. The suggested fix (conditional cleanup on verified process termination) is concrete and actionable.

The review was lean — one finding, two design strengths, three testing gaps. No false positives. The signal-to-noise ratio was excellent.

### [Gemini] 3.1 Pro

**Duration:** 4 min 58 s

Gemini's CRITICAL finding (stale PID recycling killing unrelated processes) was overblown. The claimed scenario in `handleDeletion` is incorrect: the function fetches a fresh instance via `store.Inspect` at line 42, immediately before calling `stopInstanceForcibly` at line 52. The broader PID recycling concern in `shutdownHostagent` is theoretically valid but requires SIGKILL failure combined with exact PID recycling within milliseconds — a scenario unlikely enough to warrant IMPORTANT at most. The suggested fix (re-fetching inside `stopInstanceForcibly` and adding a status check) would add complexity for negligible benefit and introduces a `store` dependency into a pure cleanup function.

Gemini's IMPORTANT finding about `waitForInstanceStopped` aborting on transient errors is valid but overstated at IMPORTANT severity. The practical risk is low since `store.Inspect` errors on own instances are rare.

Gemini had one false positive (the CRITICAL) and one severity inflation (the IMPORTANT). Its design strengths section was accurate.

### Summary

| Metric | Claude Opus 4.6 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 3:37 | 4:22 | 4:58 |
| Critical | 0 | 0 | 1 |
| Important | 1 | 1 | 1 |
| Suggestion | 2 | 0 | 0 |
| Design observations | 5 | 2 | 3 |
| False positives | 0 | 0 | 1 |
| Unique insights | 2 | 1 | 1 |

Claude provided the broadest coverage with the highest signal-to-noise ratio and no false positives. Codex was the most efficient — one finding, but it was the most impactful in the review. Gemini's strength was probing persistent-state lifecycle paths, but its top finding was a false positive after verification. The multi-agent approach paid off: Claude's `StopWithWait` finding and Codex's PID file cleanup finding were each caught by only one agent.

---

## Appendix: Original Reviews

### Claude Opus 4.6

## PR #242: Run Lima BATS tests on Windows via MSYS2

**Verdict: Approve with suggestions**

### Executive Summary

This PR makes Lima VM lifecycle management work correctly on Windows by fixing process signaling (replacing console-wide break events with targeted process group signals), hardening WSL2 cleanup (always terminating before unregistering to prevent `wslservice.exe` deadlocks), adding Windows-specific SSH key pre-generation, and switching CI from WSL2-based to MSYS2-based BATS tests. The code is well-structured and the approach is sound; the findings below are all SUGGESTION-level.

### Findings

#### Critical Issues

None.

#### Important Issues

None.

#### Suggestions

1. **`StopWithWait` timeout fallback sends SIGTERM, not SIGKILL** — `pkg/service/cmd/service.go:361-363` (important, gap)

```go
			case <-timeout:
				// Graceful shutdown timed out; force-kill so we don't leave
				// a hung service process.
				_ = process.Kill(pid)
```

On Unix, `process.Kill` sends SIGTERM (line 32 of `process_unix.go`), which the service process may ignore or handle slowly — the same signal was already sent via `process.Interrupt` (SIGINT) at line 345. After a 60-second wait, the escalation should be to SIGKILL. On Windows this is not an issue since `process.Kill` uses `TerminateProcess` which is unconditional. Git blame confirms `76ef79d` introduced this line.

Fix: Use `process.KillTree` for the timeout fallback, matching the pattern in `shutdownAllHostagents` (line 515 of `limavm_controller.go`):

```diff
 			case <-timeout:
 				// Graceful shutdown timed out; force-kill so we don't leave
 				// a hung service process.
-				_ = process.Kill(pid)
+				_ = process.KillTree(context.Background(), pid)
```

2. **`shutdownHostagent` uses parent context for `store.Inspect` in the `forceStop` closure** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:489` (suggestion, regression)

```go
	forceStop := func() {
		if inst == nil {
			var err error
			inst, err = store.Inspect(ctx, name)
```

The `ctx` here is the reconciler context. If `shutdownHostagent` is called from `handleRestartNeeded` (line 292) and the context is nearing its deadline, `store.Inspect` could fail with a context error, skipping WSL2 cleanup and tmp file removal. This is low risk because the reconciler context is typically long-lived, but the `shutdownAllHostagents` path (lines 520-527) already demonstrates the correct pattern of using a fresh `context.Background()` when the parent context might be cancelled.

Fix: Consider using a background context with a bounded timeout for the force-stop path, similar to `waitAfterKill`:

```diff
 	forceStop := func() {
 		if inst == nil {
 			var err error
-			inst, err = store.Inspect(ctx, name)
+			forceCtx, forceCancel := context.WithTimeout(context.Background(), 30*time.Second)
+			defer forceCancel()
+			inst, err = store.Inspect(forceCtx, name)
```

3. **`wsl.exe --terminate` deferred cancel in a loop** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:611`

```go
		wslCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
		defer cancel()
```

`stopInstanceForcibly` is called from a loop in `shutdownAllHostagents` (line 525), from `shutdownHostagent`'s `forceStop` closure, and from `handleDeletion`. The `defer cancel()` is fine since it runs when `stopInstanceForcibly` returns, but it's worth noting that if the parent `ctx` is already cancelled (which it is in `shutdownAllHostagents`), the 10-second timeout is anchored to an already-done context. In `shutdownAllHostagents` the caller already wraps it with `context.Background()` at line 522, so this works correctly. Not a bug — just confirming the analysis.

### Design Observations

**Strengths**

- The decision to replace `limainstance.StopForcibly` with a custom `stopInstanceForcibly` is well-motivated. Lima's `SysKill` on Windows uses `GenerateConsoleCtrlEvent(CTRL_BREAK)` without targeting a specific process group, which would kill the RDD service. The replacement uses `process.KillTree` (`taskkill /F /T`) which correctly targets the process tree. The comment at lines 575-582 explains the "why" clearly.

- The WSL2 deadlock prevention strategy is thorough: always terminate before unregister, wrap `wsl.exe` calls with timeouts, and include a nuclear recovery option in the BATS cleanup test (killing `wslservice.exe`). This defensive layering reflects hard-won operational experience.

- The `process` package API (`SetGroup`, `Interrupt`, `Kill`, `KillTree`) provides clean cross-platform abstractions with well-documented platform semantics. The Unix and Windows implementations are each minimal and correct.

- The SSH key pre-generation on Windows (`ssh_keys_windows.go`) handles all the crash-safety cases correctly: temp file cleanup before generation, atomic rename with public-key-first ordering (so a partial failure leaves no orphaned private key), and fallback to regeneration if the public key is missing.

- Extracting VM templates into `bats/helpers/vm_template.bash` with the `RDD_WSL_DISTRO` switch is a clean way to support both finch and opensuse rootfs without duplicating template YAML across test files.

### Testing Assessment

Test coverage is good. The existing BATS tests have been adapted to work cross-platform with `assert_process_alive` and `process_kill` replacing Unix-only `kill -0` and `kill -9`. The `editor_cmd` wrapper handles MSYS2 path conversion.

Untested scenarios (ranked by risk):

1. **Concurrent `ensureSSHKeys` calls**: Two controller processes calling `ensureSSHKeys` simultaneously could race on the same temp file path (`privPath + ".tmp"`). In practice this is unlikely — only one lima-controller instance runs at a time — but the function has no file-locking or unique temp path. Low risk given the single-instance deployment model.

2. **PID recycling between `KillTree` and `waitForProcessExit`**: On Windows, PIDs recycle quickly. After `process.KillTree` terminates a hostagent, the `waitForProcessExit` at line 507 waits on `state.procExited` (closed by `cmd.Wait()`), not on the PID, so this is safe. No issue here.

### Documentation Assessment

The comments throughout the new code are thorough and explain Windows-specific gotchas well. The PR description provides good context on each commit. No documentation gaps identified.

### Commit Structure

The six commits are well-organized, each representing a self-contained concept:

1. `76ef79d` — Process signaling fix (core Windows fix)
2. `d9fbdb3` — WSL2 cleanup hardening (builds on #1)
3. `297dac3` — Enable Lima BATS on Windows (test infra)
4. `c207006` — SSH key pre-generation (workaround for Lima/MSYS2 path issue)
5. `7962cc8` — CI: MSYS2 instead of WSL2 (CI infra)
6. `6dea712` — Log level propagation (independent improvement)

Clean progression from core fix → hardening → test enablement → CI. No fixup commits.

### Acknowledged Limitations

- **Sequential hostagent shutdown** — `limavm_controller.go:504-505`: `// TODO: Wait on all hostagents in parallel instead of sequentially; with the current loop, the total wait is N × gracefulShutdownTimeout in the worst case.` Pre-existing and acknowledged in MEMORY.md. This PR doesn't make it worse.

- **Non-blocking stop** — `limavm_lifecycle.go:442-443`: `// TODO: Non-blocking stop: send SIGINT and return immediately; the watcher detects the Exiting event and triggers a reconcile.` Pre-existing; this PR doesn't change the blocking stop behavior but does refactor it into `shutdownHostagent`.

### Unresolved Feedback

No review comments on this PR.

---

### Codex GPT 5.4

### Executive Summary

This PR meaningfully improves Windows Lima lifecycle handling: hostagents now run in isolated process groups, the controller force-stops WSL2 distros before unregistering them, and CI moves Windows BATS to MSYS2. I found one important regression in the new forced-stop path: it can report a VM as stopped after kill failures by deleting the PID files that Lima uses as its source of truth.

### Findings

**Critical Issues**

None.

**Important Issues**

1. **Forced-stop can hide failed kills by deleting Lima's PID files anyway** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:586-635` (important, regression)

```go
for _, pid := range []int{inst.DriverPID, inst.HostAgentPID} {
	if pid > 0 {
		if err := process.KillTree(ctx, pid); err != nil {
			logger.V(1).Info("Failed to kill process tree", "pid", pid, "error", err)
		}
	}
}
```

```go
if strings.HasSuffix(f.Name(), suffix) {
	path := filepath.Join(inst.Dir, f.Name())
	if err := os.Remove(path); err != nil {
		logger.V(1).Info("Failed to remove tmp file", "path", path, "error", err)
	}
}
```

`stopInstanceForcibly()` ignores `process.KillTree()` failures at lines 587-590, then unconditionally removes every `.pid`/`.sock`/`.tmp` file under the instance directory at lines 616-633. That matters because Lima's `store.Inspect()` derives `StatusStopped` from missing PID files in `.../pkg/store/instance.go` at lines 169-181, so `waitForInstanceStopped()` at line 564 and `stopInstance()` at lines 452-471 can now conclude the VM is stopped even when the driver or hostagent is still alive. In practice, one failed `taskkill` / `kill(-pgid)` leaves orphaned VM processes running while the controller records a clean stop and allows the next start/delete path to proceed against inconsistent state.

Fix: make PID/socket cleanup conditional on verified process termination. If `process.KillTree()` fails at line 589, or the PID still exists after a bounded recheck, return an error and leave the PID files in place so `store.Inspect()` continues to reflect reality. If you still need stale-file cleanup for Windows restartability, scope it to files whose owning process is confirmed dead rather than deleting all Lima temp files unconditionally.

**Suggestions**

None.

### Design Observations

**Strengths**

- Isolating hostagents into their own process groups via `pkg/util/process/process_windows.go` and `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` is the right fix for the original Windows console-signal blast radius; it removes a whole class of "service kills itself while stopping a VM" failures.
- Pre-generating Lima SSH keys in `pkg/controllers/lima/limavm/controllers/ssh_keys_windows.go` is a pragmatic workaround for the MSYS2/native-path mismatch, and the temp-file-plus-rename approach is a good crash-safety improvement.

### Testing Assessment

1. No test exercises the failure branch where `process.KillTree()` / `taskkill` fails but `stopInstanceForcibly()` still removes PID files; that is the highest-risk path in this PR because it can create false `Stopped` states.
2. I did not find coverage for forced-stop behavior on Unix/macOS after the switch from Lima's per-PID `SysKill` to the new tree-kill helper, especially around restart/delete flows.
3. I did not find a test for the fallback path where Windows `GenerateConsoleCtrlEvent` fails and the controller drops into direct kill + forced cleanup.

### Documentation Assessment

No user-facing documentation gaps stood out. Code comments around the Windows signal model and WSL2 deadlock rationale are strong.

### Acknowledged Limitations

- `pkg/controllers/lima/limavm/controllers/limavm_controller.go:504`: `"TODO: Wait on all hostagents in parallel instead of sequentially"` — The change does not worsen correctness, but it still leaves shutdown latency proportional to the number of VMs.
- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:442`: `"TODO: Non-blocking stop: send SIGINT and return immediately"` — Still accurate; this PR improves the stop mechanics but keeps reconcile blocked on shutdown completion.

---

### Gemini 3.1 Pro

### Executive Summary

This PR robustly improves process management and CI stability on Windows. By isolating hostagents in their own process groups, it resolves a severe historical bug where stopping a VM would broadcast `CTRL_BREAK_EVENT` to the entire console, killing the background RDD service. Switching BATS tests to MSYS2 is also a clean and effective workaround for WSL interop path conversion issues. However, the migration of forced process termination logic introduces a critical regression around PID recycling on Windows that risks terminating unrelated processes.

### Findings

#### Critical Issues

1. **Unconditional forced termination of stale PIDs kills unrelated processes** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:586-593` (critical, regression)

```go
func stopInstanceForcibly(ctx context.Context, logger logr.Logger, inst *limatype.Instance) {
	for _, pid := range []int{inst.DriverPID, inst.HostAgentPID} {
		if pid > 0 {
			if err := process.KillTree(ctx, pid); err != nil {
				logger.V(1).Info("Failed to kill process tree", "pid", pid, "error", err)
			}
		}
	}
```

The replacement of `limainstance.StopForcibly` with `stopInstanceForcibly` introduces a severe process-killing bug. In `handleDeletion`, `stopInstanceForcibly` is called unconditionally on `existingInst` to ensure WSL2 distros are terminated. The author correctly identified that PIDs can be recycled rapidly on Windows—adding `existingInst.DriverPID = 0` to prevent Lima's `Delete` from killing them—but `stopInstanceForcibly` is called *before* this safeguard and blindly acts on the exact same stale PIDs via `process.KillTree`.

Furthermore, callers like `shutdownHostagent` and `killOrphanedHostagent` pass an `inst` object that was fetched up to 30 seconds prior (before waiting for graceful shutdown). If the hostagent exited gracefully just as the timeout expired, the passed `inst` is stale, and `stopInstanceForcibly` will kill whatever unrelated process tree currently holds those recycled PIDs.

Fix: Update `stopInstanceForcibly` to always fetch a fresh instance state to guarantee PIDs are current, and explicitly skip process termination if the instance is not actively running.

```diff
 func stopInstanceForcibly(ctx context.Context, logger logr.Logger, inst *limatype.Instance) {
+	// Always get the latest state before killing processes to avoid acting on stale PIDs
+	// if the process exited while waiting (e.g., during graceful shutdown).
+	if currentInst, err := store.Inspect(ctx, inst.Name); err == nil && currentInst != nil {
+		inst = currentInst
+	}
+
+	// Only attempt to kill processes if Lima still considers the instance running/broken.
+	// If it is stopped, any non-zero PIDs are stale and killing them terminates unrelated processes.
+	if inst.Status != limatype.StatusStopped {
 		for _, pid := range []int{inst.DriverPID, inst.HostAgentPID} {
 			if pid > 0 {
 				if err := process.KillTree(ctx, pid); err != nil {
 					logger.V(1).Info("Failed to kill process tree", "pid", pid, "error", err)
 				}
 			}
 		}
+	}
 	// Unlock additional disks so they can be reused by other instances.
```

#### Important Issues

1. **Graceful shutdown poll aborts immediately on transient read errors** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:558-561` (important, gap)

```go
			inst, err := store.Inspect(ctx, name)
			if err != nil {
				return false
			}
```

If `store.Inspect` encounters a transient read error (e.g., if Lima is atomically rewriting the `lima.yaml` status file at that exact millisecond) or if the instance was successfully deleted while polling (`os.ErrNotExist`), `waitForInstanceStopped` immediately aborts the graceful shutdown poll and triggers a forced kill sequence.

Fix: Log or ignore the error and continue polling until the context expires, treating `os.ErrNotExist` as stopped.

```diff
 			inst, err := store.Inspect(ctx, name)
 			if err != nil {
+				if errors.Is(err, os.ErrNotExist) {
+					return true
+				}
-				return false
+				continue // Transient read error, keep polling
 			}
```

#### Suggestions

None.

### Design Observations

**Strengths**:
- **Windows Process Group Isolation:** Utilizing `CREATE_NEW_PROCESS_GROUP` via `SetGroup` and targeting `CTRL_BREAK_EVENT` directly at the child's process group ID is an elegant and robust solution. It perfectly fixes the long-standing bug where stopping a VM would crash the RDD service.
- **Crash-Resistant File Operations:** In `ensureSSHKeys` (`ssh_keys_windows.go`), the explicit cleanup of `.tmp` and `.tmp.pub` files prevents interrupted `ssh-keygen` runs from blocking future attempts with interactive prompts. The handling of partial state (e.g., if the `.pub` file rename succeeds but the private key rename fails and crashes) ensures the controller never gets permanently wedged.
- **Log Level Fallback:** `setVerbosityFromEnv` correctly handles unsupported log levels by falling through to the default klog verbosity with a clear warning, rather than silently ignoring user intent.

### Testing Assessment

The migration to MSYS2 is well-covered and functionally verified by the test suites. The tests robustly handle Windows-specific quirks, such as checking processes via `tasklist.exe` instead of `kill -0`.

1. **Untested Scenario:** The race condition where a stale `inst` object is passed to `stopInstanceForcibly` causing random PID kills is difficult to test deterministically via BATS, making the suggested inline code fix crucial.

### Documentation Assessment

No missing documentation updates.

### Acknowledged Limitations

- "Wait on all hostagents in parallel instead of sequentially; with the current loop, the total wait is N × gracefulShutdownTimeout in the worst case." (`pkg/controllers/lima/limavm/controllers/limavm_controller.go:504`). The sequential loop makes shutdown linearly slower for each running VM.
- `wsl.exe --unregister` can deadlock `wslservice.exe`. (`pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:48`). The addition of `wsl.exe --terminate` before unregistering mitigates this, but it is an acknowledged underlying WSL subsystem bug.
- BATS `rdd ctl wait --for=delete` may hang indefinitely if the reconciler is blocked by WSL. (`bats/tests/33-lima-controllers/limavm-running.bats:516`). The test suite explicitly mitigates this by applying an outer timeout and force-killing `wslservice.exe` to recover the runner.
