# Deep Review: 20260322-142409-pr-242

| | |
|---|---|
| **Date** | 2026-03-22 14:24 |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#242](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/242) — Run Lima BATS tests on Windows via MSYS2 |
| **Branch** | `msys2-ci` |
| **Commits** | `99f1148` Fix Windows process signaling for Lima hostagent lifecycle<br>`db225a5` Fix stopInstanceForcibly to terminate WSL2 distro and clean up files<br>`03546e4` Enable Lima BATS tests on Windows<br>`498d0c0` Pre-generate Lima SSH keys on Windows to avoid broken path conversion<br>`7e0e21e` Run Windows BATS tests via MSYS2 instead of WSL2<br>`e201440` Honor RDD_LOG_LEVEL in external controllers |
| **Reviewers** | Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro |
| **Verdict** | **Merge with fixes** — two process-lifecycle issues on Windows need attention before merge |
| **Wall-clock time** | 18 min 21 s |

---

## Consolidated Review

### Executive Summary

This PR fixes Windows process signaling by placing hostagents in their own process groups (`CREATE_NEW_PROCESS_GROUP`), adds WSL2 deadlock prevention (`wsl --terminate` before `--unregister`), pre-generates SSH keys to sidestep MSYS2 path conversion, and switches CI from WSL2-based to MSYS2-based BATS. The core design is sound: each fix targets a real Windows deficiency. Two important issues remain: `stopInstanceForcibly` can kill recycled PIDs when called on stopped instances, and `killHostagent` only kills the parent process, leaving children alive on Windows.

---

### Critical Issues

None.

---

### Important Issues

1. **Stopped-instance force-stop can kill recycled PIDs on Windows** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:47-56` (important, regression) [Codex GPT 5.4]

```go
// Always force-stop before deleting, even if the instance appears
// stopped.
stopInstanceForcibly(ctx, logger, existingInst)
// Clear stale PIDs so Lima's Delete → StopForcibly does not kill
// unrelated processes if the PIDs were recycled (likely on Windows).
existingInst.DriverPID = 0
existingInst.HostAgentPID = 0
```

The `deleteInstance` path calls `stopInstanceForcibly` at line 52 before zeroing the PID fields at lines 55–56. For a stopped instance on Windows, Lima's `ReadPIDFile` returns a non-zero PID if the PID file exists and the OS recycled that PID to a new process (`store/instance.go:210-211` returns early for any running Windows process). `stopInstanceForcibly` then calls `process.KillTree` at line 598 with this recycled PID, killing an unrelated process tree. The zeroing at lines 55–56 prevents Lima's subsequent `Delete → StopForcibly` from repeating the damage but does not prevent the first kill.

On Unix this cannot happen: `ReadPIDFile` sends signal 0 to verify ownership and removes stale PID files for dead processes (`store/instance.go:213-217`). On Windows, `os.FindProcess` succeeds for any running PID regardless of ownership.

The intent of the unconditional `stopInstanceForcibly` call is to run `wsl --terminate` before `--unregister`. PID-based killing is unnecessary for stopped instances.

Fix: split WSL2-specific teardown from PID-based killing for stopped instances:

```diff
- stopInstanceForcibly(ctx, logger, existingInst)
+ if existingInst.Status == limatype.StatusRunning || existingInst.Status == limatype.StatusBroken {
+     stopInstanceForcibly(ctx, logger, existingInst)
+ } else if existingInst.VMType == limatype.WSL2 {
+     // A "stopped" WSL2 distro can retain kernel state that deadlocks
+     // wsl.exe --unregister. Terminate it without PID-based killing,
+     // since the PIDs may have been recycled on Windows.
+     wslCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
+     defer cancel()
+     distroName := "lima-" + existingInst.Name
+     _ = exec.CommandContext(wslCtx, "wsl.exe", "--terminate", distroName).Run()
+ }
  existingInst.DriverPID = 0
  existingInst.HostAgentPID = 0
```

2. **`killHostagent` only terminates the parent, leaving children alive on Windows** — `pkg/controllers/lima/limavm/controllers/hostagent_watcher.go:215-222` (important, regression) [Gemini 3.1 Pro]

```go
func (r *LimaVMReconciler) killHostagent(name string) {
	r.instanceStatesMu.RLock()
	state, ok := r.instanceStates[name]
	r.instanceStatesMu.RUnlock()
	if !ok || state.cmd == nil || state.cmd.Process == nil {
		return
	}
	_ = state.cmd.Process.Kill()
}
```

`Process.Kill()` at line 222 sends `TerminateProcess` on Windows, which kills only the parent hostagent. Child processes (SSH port forwarders) inherit the parent's stdout/stderr pipes and keep them open. The subsequent `forceStop()` at `limavm_lifecycle.go:529` calls `process.KillTree` with the inst's `HostAgentPID`, but on Windows `taskkill /F /T /PID` returns exit code 128 (process not found) for the already-dead parent and does not traverse to its children. The children survive, `cmd.Wait()` blocks on the held pipes, and the reaper goroutine at `hostagent_watcher.go:83-91` leaks.

On Unix this self-heals: `KillTree` sends `SIGKILL` to the process group (`kill(-pid, SIGKILL)`), which kills all members regardless of the leader's state.

Fix: replace `Process.Kill()` with `process.KillTree` to kill the entire process group while the parent is still alive:

```diff
 func (r *LimaVMReconciler) killHostagent(name string) {
 	r.instanceStatesMu.RLock()
 	state, ok := r.instanceStates[name]
 	r.instanceStatesMu.RUnlock()
 	if !ok || state.cmd == nil || state.cmd.Process == nil {
 		return
 	}
-	_ = state.cmd.Process.Kill()
+	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+	defer cancel()
+	_ = process.KillTree(ctx, state.cmd.Process.Pid)
 }
```

3. **Disk unlock ignores failed process kills** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:604-613` (important, gap) [Codex GPT 5.4]

```go
// Unlock additional disks so they can be reused by other instances.
for _, d := range inst.AdditionalDisks {
	disk, err := store.InspectDisk(d.Name)
	if err != nil {
		logger.V(1).Info("Disk does not exist", "disk", d.Name)
		continue
	}
	if err := disk.Unlock(); err != nil {
		logger.V(1).Info("Failed to unlock disk", "disk", d.Name, "error", err)
	}
}
```

`stopInstanceForcibly` gates PID/socket file cleanup on `allKilled` at lines 629–631 but unlocks additional disks at lines 604–613 unconditionally. If `KillTree` failed (the VM driver is still alive), another instance can now lock and attach the same disk while the original VM is still using it. The PID cleanup correctly avoids this pattern; the disk unlock should follow the same guard.

Fix:

```diff
- // Unlock additional disks so they can be reused by other instances.
- for _, d := range inst.AdditionalDisks {
+ if allKilled {
+     // Unlock additional disks only after the instance is confirmed gone.
+     for _, d := range inst.AdditionalDisks {
```

---

### Suggestions

1. **Cross-console `process.Interrupt` fails silently on Windows** — `pkg/service/cmd/service.go:345` (suggestion, gap) [Gemini 3.1 Pro]

```go
if err := process.Interrupt(pid); err != nil {
	if err := process.Kill(pid); err != nil {
		return fmt.Errorf("failed to stop %q control plane with pid %d: %w", instance.Name(), pid, err)
	}
}
```

`GenerateConsoleCtrlEvent` delivers the signal only if the caller shares a console with the target. If invoked from a different terminal or if the daemon is detached, the call fails and the code falls back to `process.Kill` (`TerminateProcess`), which bypasses the graceful shutdown path (`shutdownAllHostagents`). Hostagents survive because they run in separate process groups, and self-heal on the next startup. A comment documenting this limitation and the self-healing behavior would prevent future confusion.

Fix: add a comment above line 345 explaining the cross-console limitation.

2. **`KillTree` semantics differ between Unix and Windows** — `pkg/util/process/process_unix.go:39`, `pkg/util/process/process_windows.go:66` (suggestion, gap) [Gemini 3.1 Pro]

On Unix, `KillTree` sends `SIGKILL` to the process group; children started with `SetGroup` (their own group) are not affected. On Windows, `taskkill /T` walks the parent-child tree regardless of process groups and kills all descendants. This difference is harmless in the current codebase because `shutdownAllHostagents` calls `KillTree` directly on each hostagent's PID, but the abstraction leak should be documented on the `KillTree` function.

3. **`ensureSSHKeysAt` runs `ssh-keygen` without a timeout** — `pkg/controllers/lima/limavm/controllers/ssh_keys.go:47-53` (suggestion, regression) [Claude Opus 4.6]

```go
cmd := exec.CommandContext(ctx, "ssh-keygen", "-t", "ed25519", "-q", "-N", "",
	"-C", "lima", "-f", tmpPath)
```

The `ctx` originates from `context.Background()` at `limavm_controller.go:453`. If `ssh-keygen` hangs (e.g., a misconfigured hardware security module or MSYS2 path issue), controller setup blocks indefinitely. A 30-second timeout would bound this.

4. **`setVerbosityFromEnv` mapping undocumented for operators** — `pkg/external/main.go:257-278` (suggestion, regression) [Claude Opus 4.6]

The mapping from `RDD_LOG_LEVEL` values to klog verbosity levels (`trace`→4, `debug`→2, `info`→0) exists only in source code. Operators setting `RDD_LOG_LEVEL=debug` for the service (which uses logrus) may not realize that external controllers interpret it as klog verbosity. A brief comment in the CI workflow or the CLAUDE.md logging section would clarify this.

---

### Design Observations

**Concerns**

- **`exec.CommandContext` races with graceful shutdown** (future) [Claude Opus 4.6] — `startInstance` at `limavm_lifecycle.go:397` creates the hostagent command with `exec.CommandContext(ctx, ...)` where `ctx` is the reconciler context. When the manager shuts down and cancels this context, Go sends `os.Kill` (SIGKILL/TerminateProcess) immediately, racing with `shutdownAllHostagents` which attempts graceful SIGINT first. If SIGKILL arrives first, the hostagent dies without running its cleanup. `shutdownAllHostagents` then sees `procExited` closed, sets `graceful = true`, and skips `stopInstanceForcibly`. The orphaned resources self-heal on the next startup via `killOrphanedHostagent`. Fix: override `cmd.Cancel` after `cmd.Start()` to use `process.Interrupt` instead of the default `os.Kill`, aligning context cancellation with the graceful shutdown design.

**Strengths**

- `stopInstanceForcibly` cleanly enumerates four cleanup concerns (process tree, disk locks, WSL2 distro, temp files) and correctly gates file cleanup on kill success — removing PID files for a still-running process would mask the problem from `store.Inspect`. [Claude Opus 4.6, Codex GPT 5.4]
- The `ensureSSHKeysAt` function handles all six combinations of partial on-disk state (both exist, private-only, public-only, neither, corrupt, stale temps) with correct crash-recovery semantics: generate to `.tmp`, rename public first, then private, with cleanup on failure at each step. [Claude Opus 4.6]
- Moving hostagents into their own process group and routing graceful shutdown through `process.Interrupt` is the right fix for the console-wide `CTRL_BREAK_EVENT` problem. [Codex GPT 5.4, Gemini 3.1 Pro]
- The WSL2-specific `wsl --terminate` before `--unregister` addresses the actual deadlock source instead of hardening only the test harness. [Codex GPT 5.4, Gemini 3.1 Pro]
- `shutdownHostagent` consolidates three inline shutdown sequences into a single function with explicit timeouts at every stage, bounding worst-case blocking to ~70s. [Claude Opus 4.6]

---

### Testing Assessment

Test coverage is solid for the main paths. The SSH key generation has thorough unit tests covering all partial-state scenarios. The BATS integration tests cover the full VM lifecycle including crash recovery and orphan recovery. Untested scenarios, ranked by risk:

1. **Force-stop fallback in `shutdownHostagent`** — acknowledged at `limavm_lifecycle.go:518-520`. Requires a hostagent that ignores shutdown signals.
2. **Windows deletion of a stopped/crashed WSL2 instance with stale PID files** — the new path at `limavm_lifecycle.go:47-56` (finding 1 above).
3. **Partial-failure branch in `stopInstanceForcibly`** — when one `KillTree` call fails but cleanup continues (finding 3 above).
4. **`ensureSSHKeys` on Windows** — unit tests cover `ensureSSHKeysAt`, but the platform-specific wrapper that resolves the Lima config directory requires a Windows environment.

---

### Documentation Assessment

In-code rationale for the WSL2 and process-group changes is thorough. The `setVerbosityFromEnv` mapping (suggestion 4) would benefit from a note in the CI workflow or CLAUDE.md logging section.

---

### Commit Structure

The six commits form a clean progression: process signaling fix → force-stop + WSL2 cleanup → Windows BATS enablement → SSH key pre-generation → CI workflow switch → log-level propagation. Each commit represents a single concept and its message accurately describes the change.

The PR bundles five distinct concerns (process signaling, WSL2 deadlock, SSH keys, CI workflow, log-level propagation), but they are interdependent: the BATS tests cannot pass on Windows without all five fixes in place. Splitting them would create un-testable intermediate states.

---

### Acknowledged Limitations

1. **Sequential hostagent shutdown** — `limavm_controller.go:504-505`: `// TODO: Wait on all hostagents in parallel instead of sequentially; with the current loop, the total wait is N × gracefulShutdownTimeout in the worst case.` — The WSL2 termination step adds ~10s to the per-instance force-stop path. N is typically 1 (single VM per instance), so the practical impact is minor. [Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro]

2. **Non-blocking stop** — `limavm_lifecycle.go:442-443`: `// TODO: Non-blocking stop: send SIGINT and return immediately; the watcher detects the Exiting event and triggers a reconcile.` — Unchanged by this PR. `shutdownHostagent` blocks the reconciler during stop, consistent with the pre-existing design. [Claude Opus 4.6]

3. **Force-stop test gap** — `limavm_lifecycle.go:518-520`: `// Not tested: the forceStop fallback requires a hostagent that ignores shutdown signals.` — Explicitly documented. The risky cleanup branches (findings 2 and 3) live in this untested path. [Claude Opus 4.6, Codex GPT 5.4]

---

## Agent Performance Retro

### [Claude] Opus 4.6

Claude produced the most thorough review (10.4 KB), tracing lifecycle paths through initialization, shutdown, and restart. Its strongest unique contribution was identifying the `exec.CommandContext` race with graceful shutdown — a subtle design concern that requires understanding Go's internal `cmd.Cancel` behavior and the interaction between manager context cancellation and `shutdownAllHostagents`. Claude also correctly identified the `ensureSSHKeysAt` timeout gap and the stale-PID comment suggestion.

However, Claude's only "important" finding (test teardown on Windows) was weaker than the other agents' findings. The concern about stale PID files in `local_teardown_file` is mitigated by the `rm -rf` that removes the entire instance directory. Claude did not identify the more impactful PID recycling issue in the production code (finding 1) or the `killHostagent` child-process issue (finding 2).

Depth was excellent: Claude traced through `store.Inspect`, `ReadPIDFile`, and the watcher goroutine lifecycle. No false positives.

### [Codex] GPT 5.4

Codex delivered a focused, high-signal review (5.9 KB) with two strong findings. Its PID recycling analysis (finding 1) identified the exact ordering problem: `stopInstanceForcibly` runs before the PID fields are zeroed, and the comment at lines 53–56 shows the author was aware of the risk for Lima's subsequent `Delete` call but missed the first invocation. The disk-unlock gap (finding 3) was also well-reasoned, noting that the `allKilled` guard protects PID cleanup but not disk locks.

Codex did not identify the `killHostagent` child-process issue. Its review was concise but thorough within its scope. No false positives.

### [Gemini] 3.1 Pro

Gemini identified the `killHostagent` child-process issue (finding 2), which neither Claude nor Codex caught. The analysis correctly traced the Windows-specific failure: `TerminateProcess` kills only the parent, `taskkill /T` on a dead PID returns 128 without traversing children, and the watcher sub-goroutine leaks.

Gemini's cross-console signaling and `KillTree` semantics findings were valid but less impactful (suggestions). The cross-console concern correctly identifies a limitation but the self-healing behavior makes it low-risk. One minor accuracy issue: Gemini stated `killHostagent` "leaves child processes orphaned and can leak the watcher goroutine" in the title, but the watcher goroutine does not leak (the main `runWatcher` goroutine returns after `events.Watch` completes; only the sub-goroutine doing `cmd.Wait` leaks). This distinction matters for understanding the blast radius.

No false positives.

### Summary

| Metric | Claude Opus 4.6 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 11:23 | 5:31 | 5:16 |
| Critical | 0 | 0 | 0 |
| Important | 0 | 2 | 1 |
| Suggestion | 3 | 0 | 2 |
| Design observations | 2 | 2 | 1 |
| False positives | 0 | 0 | 0 |
| Unique insights | 2 | 1 | 1 |

All three agents contributed unique, non-overlapping insights. Codex delivered the highest-impact findings (PID recycling, disk unlock) with the best signal-to-noise ratio. Gemini caught the one Windows-specific process-tree issue the others missed. Claude provided the deepest architectural analysis but spent its effort on lower-severity findings. The multi-agent approach justified itself: no single agent caught all three important issues.

---

## Improvement Recommendations

### Repo context updates

1. **Windows `taskkill /T` does not traverse children of dead parents** — When reviewing process lifecycle code on Windows, verify that `KillTree` / `taskkill /F /T` is called while the parent process is still alive. If the parent has already been killed (e.g., via `Process.Kill()` or `TerminateProcess`), `taskkill /T` returns exit code 128 without killing children. This differs from Unix where `kill(-pgid, SIGKILL)` kills all process group members regardless of the leader's state.

---

## Appendix: Original Reviews

### Claude Opus 4.6

Now I have a comprehensive picture. Let me compile the review.

---

## PR #242: Run Lima BATS tests on Windows via MSYS2

| Metric | Value |
|--------|-------|
| Verdict | **Approve** |
| Critical | 0 |
| Important | 1 |
| Suggestions | 3 |

### Executive Summary

This PR fixes Windows process signaling so hostagents run in their own process group (`CREATE_NEW_PROCESS_GROUP`), adds WSL2-specific cleanup (distro termination, stale file removal) to prevent `wslservice.exe` deadlocks, pre-generates SSH keys to sidestep MSYS2 path conversion issues, and switches CI from WSL2-based to MSYS2-based BATS. The code is well-structured with thorough handling of partial-failure states and crash recovery. One important issue around test cleanup scope on Windows.

### Findings

#### Critical Issues

None.

#### Important Issues

1. **`local_teardown_file` in `limavm-running.bats` does not clean up Lima instance directory on Windows** — `bats/tests/33-lima-controllers/limavm-running.bats:22-33` (important, gap)

```bash
local_teardown_file() {
    # Clean up any remaining Lima instances
    if [[ -d "${RDD_LIMA_HOME}/${VM_NAME}" ]]; then
        rm -rf "${RDD_LIMA_HOME:?}/${VM_NAME}"
    fi
    # Clean up any WSL2 distros created by tests.
    if is_windows; then
        for vm in "${VM_NAME}" "stopped-vm"; do
            timeout 30 bash -c "MSYS_NO_PATHCONV=1 wsl.exe --unregister 'lima-${vm}'" 2>/dev/null || true
        done
    fi
}
```

The `wsl.exe --unregister` at line 31 calls into WSL before `rm -rf` at line 24 has removed the Lima instance directory. If the distro is still registered and the `--unregister` hangs (hitting the very deadlock this PR documents), the 30s timeout fires but the directory cleanup already ran. That ordering is fine. However, if the earlier "cleanup LimaVM running test" (`@test` at line 518) fails mid-way (e.g., the WSL deadlock kill at line 527 fires and the test returns `false`), subsequent `limavm-running.bats` runs may find a stale Lima instance directory with stale PID files. On Windows, `store.Inspect` does not auto-clean stale PID files (unlike Unix), so the next test run's creation path could hit `StatusRunning` for a process that no longer exists, confusing the reconciler.

Fix: In `local_teardown_file`, explicitly remove PID and socket files (`ha.pid`, `ha.sock`, `*.tmp`) from the Lima instance directory before calling `wsl.exe --unregister`, or run `stopInstanceForcibly`-equivalent cleanup (a `taskkill /F /T /PID` on any PID found in the directory):

```diff
 local_teardown_file() {
+    # On Windows, remove stale PID/socket files so the next test run's
+    # store.Inspect doesn't report a ghost process as running.
+    if is_windows && [[ -d "${RDD_LIMA_HOME}/${VM_NAME}" ]]; then
+        rm -f "${RDD_LIMA_HOME}/${VM_NAME}"/*.pid \
+              "${RDD_LIMA_HOME}/${VM_NAME}"/*.sock \
+              "${RDD_LIMA_HOME}/${VM_NAME}"/*.tmp
+    fi
     if [[ -d "${RDD_LIMA_HOME}/${VM_NAME}" ]]; then
         rm -rf "${RDD_LIMA_HOME:?}/${VM_NAME}"
     fi
```

#### Suggestions

1. **`ensureSSHKeysAt` runs `ssh-keygen` without a timeout** — `pkg/controllers/lima/limavm/controllers/ssh_keys.go:47-53` (suggestion, regression)

```go
cmd := exec.CommandContext(ctx, "ssh-keygen", "-t", "ed25519", "-q", "-N", "",
    "-C", "lima", "-f", tmpPath)
```

The `ctx` passed to `ensureSSHKeysAt` originates from `context.Background()` at `limavm_controller.go:453`. If `ssh-keygen` hangs (e.g., trying to access a hardware security module, or on a misconfigured Windows machine where it prompts for a passphrase despite `-N ""`), controller setup blocks indefinitely. A 30-second timeout on the `ssh-keygen` invocations would bound this.

Fix: Create a timeout context inside `ensureSSHKeysAt`:

```diff
+keygenCtx, cancel := context.WithTimeout(ctx, 30*time.Second)
+defer cancel()
 cmd := exec.CommandContext(keygenCtx, "ssh-keygen", "-t", "ed25519", "-q", "-N", "",
```

2. **Redundant `store.Inspect` in `shutdownHostagent` force path when `inst` is provided** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:486-503` (suggestion, regression)

```go
forceStop := func() {
    forceCtx, forceCancel := context.WithTimeout(context.Background(), 30*time.Second)
    defer forceCancel()
    forceInst := inst
    if forceInst == nil {
        var err error
        forceInst, err = store.Inspect(forceCtx, name)
```

When `inst` is non-nil (the `stopInstance` call path at line 448), `forceStop` reuses the caller-provided `inst`. But `inst` was obtained from `store.Inspect` before the graceful shutdown attempt. After 30 seconds of graceful shutdown timeout, `inst.HostAgentPID` and `inst.DriverPID` may refer to already-dead processes (or worse, recycled PIDs on Windows). Calling `process.KillTree` with these stale PIDs is harmless on Unix (`ESRCH` → nil) and on Windows (exit code 128 → nil), but it's worth a comment explaining why stale PIDs are safe here.

Fix: Add a clarifying comment:

```diff
 forceStop := func() {
+    // inst's PIDs may be stale after the graceful shutdown wait.
+    // KillTree tolerates dead PIDs (ESRCH/exit 128 → nil), and PID
+    // recycling is safe because TmpFileSuffixes cleanup is skipped
+    // if any kill returns an unexpected error.
     forceCtx, forceCancel := context.WithTimeout(context.Background(), 30*time.Second)
```

3. **`setVerbosityFromEnv` mapping not documented for operators** — `pkg/external/main.go:257-278` (suggestion, regression)

```go
case "trace":
    if err := flag.Set("v", "4"); err != nil {
```

The mapping from `RDD_LOG_LEVEL` values to klog verbosity levels (`trace`→v=4, `debug`→v=2, `info`→v=0) is only visible in the source code. Operators setting `RDD_LOG_LEVEL=debug` for the service (which uses logrus) might not realize the external controllers interpret it differently (klog verbosity). Consider adding the mapping as a comment in the CI workflow where `RDD_LOG_LEVEL` is set, or in the CLAUDE.md logging section.

### Design Observations

**Concerns**

- **`exec.CommandContext` races with graceful shutdown** (future) — `startInstance` at `limavm_lifecycle.go:397` creates the hostagent command with `exec.CommandContext(ctx, ...)` where `ctx` is the reconciler context. When the manager shuts down and cancels this context, Go's internal goroutine sends `os.Kill` (SIGKILL/TerminateProcess) to the hostagent immediately, racing with `shutdownAllHostagents` which tries graceful SIGINT first. If SIGKILL arrives first, the hostagent dies without running its cleanup (VM driver termination, WSL2 distro cleanup). The `shutdownAllHostagents` code then sees `state.procExited` closed, sets `graceful = true`, and skips `stopInstanceForcibly`. The orphaned resources are cleaned up on the next service start via `killOrphanedHostagent`, so this is self-healing but leaves resources running between service restarts. Fix: override `cmd.Cancel` after `cmd.Start()` to use `process.Interrupt` instead of `os.Kill`, aligning the CommandContext behavior with the graceful shutdown design. This would be a small change in `startInstance`.

**Strengths**

- The `stopInstanceForcibly` design cleanly enumerates the four cleanup concerns (process tree, disk locks, WSL2 distro, temp files) and correctly gates file cleanup on process kill success — removing PID files for a still-running process would mask the problem from `store.Inspect`.
- The `ensureSSHKeysAt` function handles all six combinations of partial on-disk state (both exist, private-only, public-only, neither, corrupt, stale temps) with correct crash-recovery semantics: generate to `.tmp`, rename public first, then private, with cleanup on failure at each step.
- The `shutdownHostagent` consolidation replaces three inline shutdown sequences with a single function that has explicit timeouts at every stage, bounding the worst-case blocking time to ~70s rather than relying on unbounded parent context deadlines.
- The PR comment explains the `CTRL_BREAK_EVENT` vs `GenerateConsoleCtrlEvent(CTRL_BREAK, 0)` distinction clearly — the core Windows fix that motivated the PR.

### Testing Assessment

Test coverage is good. The SSH key generation has thorough unit tests covering all partial-state scenarios. The BATS integration tests cover the full VM lifecycle including crash recovery and orphaned hostagent recovery. Specific untested scenarios, ranked by risk:

1. **Force-stop fallback in `shutdownHostagent`** — acknowledged by the comment at `limavm_lifecycle.go:518-520` ("Not tested: the forceStop fallback requires a hostagent that ignores shutdown signals"). This would require a mocked hostagent that ignores SIGINT/CTRL_BREAK, which is hard to set up in integration tests.
2. **WSL2 distro termination failure in `stopInstanceForcibly`** — the `wsl.exe --terminate` timeout path is exercised only if WSL is degraded, which is hard to simulate in CI. The BATS cleanup test (line 525-530) provides partial coverage.
3. **`ensureSSHKeys` on Windows** (the platform-specific wrapper) — the unit tests cover `ensureSSHKeysAt` but not `ensureSSHKeys` which resolves the Lima config directory. This is only testable in a Windows environment with Lima installed.

### Documentation Assessment

The CLAUDE.md logging section documents the CLI/controller logging distinction. The `setVerbosityFromEnv` mapping (finding 3) would benefit from a brief note there or in the workflow comments, but this is minor.

### Acknowledged Limitations

1. **Sequential hostagent shutdown** — `limavm_controller.go:504-505`: `// TODO: Wait on all hostagents in parallel instead of sequentially; with the current loop, the total wait is N × gracefulShutdownTimeout in the worst case.` — This limitation becomes slightly more impactful with the WSL2 termination step added to the force-stop path, but N is typically 1 in practice (single VM per instance).

2. **Non-blocking stop** — `limavm_lifecycle.go:442-443` and `limavm_lifecycle.go:211-212`: `// TODO: Non-blocking stop: send SIGINT and return immediately; the watcher detects the Exiting event and triggers a reconcile.` — Unchanged by this PR. The new `shutdownHostagent` function blocks the reconciler during stop, consistent with the pre-existing design.

3. **Force-stop test gap** — `limavm_lifecycle.go:518-520`: `// Not tested: the forceStop fallback requires a hostagent that ignores shutdown signals.` — Explicitly documented.

### Codex GPT 5.4

**Verdict**

Needs changes.

### Executive Summary

This PR fixes a real Windows problem set: hostagent signaling no longer tears down the parent service, WSL2 cleanup is more defensive, and CI moves Windows BATS execution into MSYS2. The main issue I found is in the new forced-stop path: deleting a Lima VM that only *appears* stopped can now kill unrelated Windows processes via recycled stale PIDs, and the same helper unlocks additional disks even after it knows process termination failed.

### Findings

**Critical Issues**

None.

**Important Issues**

1. **Deleting a stopped WSL2 VM can kill unrelated recycled PIDs** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:47-56` `(important, regression)`

```go
// Always force-stop before deleting, even if the instance appears
// stopped.
stopInstanceForcibly(ctx, logger, existingInst)
// Clear stale PIDs so Lima's Delete → StopForcibly does not kill
// unrelated processes if the PIDs were recycled (likely on Windows).
existingInst.DriverPID = 0
existingInst.HostAgentPID = 0
```

The new unconditional delete path at line 52 calls `stopInstanceForcibly()` even for instances that are already "stopped". That helper then kills `inst.DriverPID` and `inst.HostAgentPID` at lines 596-600 without re-validating that those PIDs still belong to this Lima VM. On Windows, `store.ReadPIDFile()` treats any currently running process with that PID as valid (`vendor/github.com/lima-vm/lima/v2/pkg/store/instance.go:205-212`), so a stale pidfile that has been recycled can point at an unrelated process. The mitigation at lines 53-56 only zeroes the fields *after* line 52 has already used them, so it prevents Lima's later `Delete()` call from repeating the damage but does not prevent the first kill.

Fix: split the "terminate WSL distro before unregister" behavior from the "kill by PID" behavior. For instances whose `existingInst.Status` is already stopped at line 47, skip PID-based killing entirely and do only WSL-specific teardown plus tmp-file cleanup. A minimal shape would be:

```diff
- stopInstanceForcibly(ctx, logger, existingInst)
+ if existingInst.Status == limatype.StatusRunning || existingInst.Status == limatype.StatusBroken {
+     stopInstanceForcibly(ctx, logger, existingInst)
+ } else if existingInst.VMType == limatype.WSL2 {
+     terminateWSLDistro(ctx, logger, existingInst.Name)
+     cleanupInstanceTmpFiles(logger, existingInst.Dir)
+ }
```

If you still want PID-based cleanup for stopped instances, revalidate ownership first against instance-specific state instead of trusting the pidfiles.

2. **Forced stop unlocks additional disks even after process termination failed** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:598-613` `(important, gap)`

```go
if err := process.KillTree(ctx, pid); err != nil {
	logger.V(1).Info("Failed to kill process tree", "pid", pid, "error", err)
	allKilled = false
}
// Unlock additional disks so they can be reused by other instances.
for _, d := range inst.AdditionalDisks {
	...
	if err := disk.Unlock(); err != nil {
```

`stopInstanceForcibly()` records a failed kill at lines 598-600, and it correctly uses `allKilled` at lines 626-631 to avoid deleting pid/socket files when process state is uncertain. But the disk unlock loop at lines 604-613 ignores that same uncertainty and drops the lock anyway. If `KillTree()` failed because the VM driver is still alive, those disks can now be attached elsewhere while the original VM is still using them.

Fix: make disk unlocking follow the same state guard as tmp-file cleanup, or better, re-inspect and confirm the instance is stopped before unlocking. For example:

```diff
- // Unlock additional disks so they can be reused by other instances.
- for _, d := range inst.AdditionalDisks {
+ if allKilled {
+     // Unlock additional disks only after we know the instance is gone.
+     for _, d := range inst.AdditionalDisks {
        ...
- }
+     }
+ } else {
+     logger.Info("Skipping disk unlock because process kill failed")
+ }
```

**Suggestions**

None.

### Design Observations

**Strengths**

- Moving hostagents into their own process group and routing graceful shutdown through `process.Interrupt()` is the right fix for the original Windows console-wide `CTRL_BREAK_EVENT` problem.
- The WSL2-specific `wsl.exe --terminate` before unregistering is a good design choice; it addresses the actual deadlock source instead of only hardening the test harness around it.

### Testing Assessment

1. The highest-risk untested case is Windows deletion of an already-stopped/crashed WSL2 instance with stale or recycled pidfiles; that is the new path added at `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:47-56`.
2. There is no coverage for the partial-failure branch in `stopInstanceForcibly()` where one `KillTree()` call fails but cleanup continues; that is exactly the branch that can mis-handle additional disk locks at `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:598-613`.
3. The timeout-driven forced-stop fallback in `shutdownHostagent()` is still explicitly untested at `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:518-520`.

### Documentation Assessment

No user-facing documentation update looks necessary. The in-code rationale for the WSL2 and process-group changes is good.

### Acknowledged Limitations

- `"Not tested: the forceStop fallback requires a hostagent that ignores shutdown signals."` — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:518-520`. This remains relevant because the risky cleanup branch is exactly where the new correctness bugs live.
- `"TODO: Wait on all hostagents in parallel instead of sequentially; with the current loop, the total wait is N × gracefulShutdownTimeout in the worst case."` — `pkg/controllers/lima/limavm/controllers/limavm_controller.go:504-505`. This PR does not worsen correctness here, but it does mean Windows service shutdown can still be much slower than necessary when multiple VMs are running.

### Gemini 3.1 Pro

**Executive Summary**
This PR successfully resolves a key pain point on Windows by running Lima hostagents in their own process groups, preventing them from inadvertently killing the daemon during shutdown or restart via console group signals. It also adds robust deadlock-prevention for WSL2 and improves the CI testing environment. There are a few important design gaps regarding process tree semantics and cross-console signaling on Windows that should be addressed or documented, as well as a regression in the new `killHostagent` helper that could leak the watcher goroutine if disk inspection fails.

### Findings

#### Critical Issues
None.

#### Important Issues

1. **`killHostagent` leaves child processes orphaned and can leak the watcher goroutine** — `pkg/controllers/lima/limavm/controllers/hostagent_watcher.go:217` (important, regression)

```go
func (r *LimaVMReconciler) killHostagent(name string) {
	// ...
	if !ok || state.cmd == nil || state.cmd.Process == nil {
		return
	}
	_ = state.cmd.Process.Kill()
}
```

`killHostagent` calls `state.cmd.Process.Kill()`, which only terminates the parent hostagent process. If the hostagent spawned children (e.g. SSH port forwarders) that inherit its stdout/stderr pipes, `cmd.Wait()` in the watcher goroutine will block indefinitely until those children exit.

While `shutdownHostagent` attempts to mitigate this by immediately calling `forceStop()` (which uses `process.KillTree`), `forceStop()` relies on reading PIDs from disk via `store.Inspect` if `inst` is nil. If `store.Inspect` fails (e.g., missing PID files), `KillTree` is skipped. The child processes are left running, and the watcher goroutine permanently leaks.

Fix: Since `killHostagent` already has the authoritative process group ID in `state.cmd.Process.Pid`, it should use `process.KillTree` directly to guarantee the whole process group is terminated regardless of disk state.

```go
func (r *LimaVMReconciler) killHostagent(name string) {
	// ...
	if !ok || state.cmd == nil || state.cmd.Process == nil {
		return
	}
	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
	defer cancel()
	_ = process.KillTree(ctx, state.cmd.Process.Pid)
}
```

2. **`process.Interrupt` fails across consoles on Windows, leading to hard-kill** — `pkg/service/cmd/service.go:345` (important, gap)

```go
	if err := process.Interrupt(pid); err != nil {
		if err := process.Kill(pid); err != nil {
			return fmt.Errorf("failed to stop %q control plane with pid %d: %w", instance.Name(), pid, err)
		}
	}
```

`StopWithWait` uses `process.Interrupt(pid)` (which on Windows calls `GenerateConsoleCtrlEvent`) to gracefully shut down the service. However, `GenerateConsoleCtrlEvent` only delivers the signal if the calling process (`rdd svc stop`) shares the same Windows console as the target process group (`rdd svc start` or daemon).

If invoked from a different terminal window or if the daemon is detached, `GenerateConsoleCtrlEvent` fails. The code then falls back to `process.Kill` (`TerminateProcess`), instantly hard-killing the daemon. Because the hostagents now run in their own process groups (via `process.SetGroup`), `TerminateProcess` on the parent does not kill them, leaving them orphaned in the background (though they will self-heal upon next restart).

Fix: Acknowledge this limitation in a comment. A true cross-console graceful shutdown on Windows will require a different IPC mechanism (e.g., a local HTTP endpoint or named pipe).

3. **`process.KillTree` semantics differ between Unix and Windows for processes with `SetGroup`** — `pkg/util/process/process_unix.go:28` (important, gap)

```go
func KillTree(_ context.Context, pid int) error {
	err := unix.Kill(-pid, unix.SIGKILL)
```

Unix `unix.Kill(-pid, ...)` only kills the process group. If the daemon starts hostagents with `SetGroup` (placing them in a new process group), `KillTree` on the daemon's PID will *not* kill the hostagents on Unix.

Conversely, on Windows, `taskkill /T` walks the Parent Process ID (PPID) tree and *will* kill all children, completely ignoring process groups. When `StopWithWait` times out and calls `KillTree` on the daemon, Unix leaves hostagents running (to be recovered as orphans on next start), while Windows force-kills them abruptly.

Fix: Document this cross-platform semantic difference on the `KillTree` function. Given the robust orphan recovery built into the controller, standardizing the behavior may not be strictly necessary, but the abstraction leak should be documented.

#### Suggestions
None.

### Design Observations

**Strengths**
- The fallback mechanism to force-stop WSL2 distros *before* unregistering them is a highly effective mitigation for a notorious `wslservice.exe` deadlock issue.
- Using MSYS2 for BATS testing on Windows provides a much cleaner environment by removing the dependency on `wsl.exe` for the test runner itself, avoiding complex path translation edge cases.

### Acknowledged Limitations

- Sequential wait during graceful shutdown:
  `// TODO: Wait on all hostagents in parallel instead of sequentially; with the current loop, the total wait is N × gracefulShutdownTimeout in the worst case.` (`pkg/controllers/lima/limavm/controllers/limavm_controller.go:505`)