# Deep Review: 20260324-062450-pr-242

| | |
|---|---|
| **Date** | 2026-03-24 06:24 |
| **Round** | 10 (of PR #242) |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#242](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/242) — Run Lima BATS tests on Windows via MSYS2 |
| **Branch** | `msys2-ci` |
| **Commits** | `26d07c0` Address review #9 findings for PR #242<br>`911b57e` Address review #8 findings for PR #242<br>`ae2aa0e` Address deep-review findings for PR #242<br>`8b7e51c` Address PR review feedback<br>`aea03ce` Honor RDD_LOG_LEVEL in external controllers<br>`1f174c6` Run Windows BATS tests via MSYS2 instead of WSL2<br>`e3685b8` Pre-generate Lima SSH keys on Windows to avoid broken path conversion<br>`3ab20a2` Enable Lima BATS tests on Windows<br>`16c73c0` Fix stopInstanceForcibly to terminate WSL2 distro and clean up files<br>`8c3ed7b` Fix Windows process signaling for Lima hostagent lifecycle |
| **Reviewers** | Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro |
| **Verdict** | **Merge with fixes** — one unbounded wait blocks service shutdown; PID zeroing silently disables Lima's kill retry |
| **Wall-clock time** | `21 min 28 s` |

---

## Executive Summary

This PR replaces WSL2-based BATS testing with MSYS2 on Windows to eliminate flaky `binfmt_misc` interop failures, fixes Windows process signaling so hostagent lifecycle operations no longer kill the RDD service, and adds WSL2 distro termination to prevent `wslservice.exe` deadlocks during cleanup. The primary findings are an unbounded process-exit wait that can hang service shutdown and a gap in the deletion path where PIDs are unconditionally zeroed regardless of whether the preceding force-stop succeeded.

---

## Critical Issues

None.

---

## Important Issues

1. **`shutdownAllHostagents` blocks indefinitely after forced kill** — `limavm_controller.go:525` (important, regression) [Claude Opus 4.6]

```go
if !graceful {
    cleanupCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    inst, err := store.Inspect(cleanupCtx, name)
    if err == nil && inst != nil {
        stopInstanceForcibly(cleanupCtx, logger, inst)
    }
    cancel()
    <-state.procExited  // blocks without timeout
}
```

After `stopInstanceForcibly` kills the process tree, the code waits on `<-state.procExited` with no timeout. This channel closes when `cmd.Wait()` returns, but `cmd.Wait()` blocks until all pipe readers close. On Windows, if a child process survives `KillTree` (e.g., because `taskkill /T` cannot traverse from a dead parent — documented at `process_windows.go:70-76`) and holds an inherited pipe, `cmd.Wait()` never returns and service shutdown hangs. The sister function `shutdownHostagent` (line 533-536) correctly uses a 10-second bounded wait via `waitAfterKill`. The pre-existing code fell through to `state.cancel()` immediately without waiting; this change adds the blocking point without a safety bound.

Fix: Add a timeout around the post-kill wait, matching `shutdownHostagent`:

```diff
         stopInstanceForcibly(cleanupCtx, logger, inst)
     }
     cancel()
-    <-state.procExited
+    waitCtx, waitCancel := context.WithTimeout(context.Background(), 10*time.Second)
+    select {
+    case <-state.procExited:
+    case <-waitCtx.Done():
+        logger.Info("Timed out waiting for hostagent to exit after forced kill", "instance", name)
+    }
+    waitCancel()
```

2. **`handleDeletion` unconditionally zeroes PIDs regardless of force-stop outcome** — `limavm_lifecycle.go:61-68` (important, regression) [Codex GPT 5.4]

```go
if runtime.GOOS == "windows" {
    existingInst.DriverPID = 0
    existingInst.HostAgentPID = 0
}
```

`stopInstanceForcibly()` records kill failures internally (`allKilled = false` at line 624) but does not propagate the result to its caller. `handleDeletion()` then unconditionally zeroes `DriverPID` and `HostAgentPID` at lines 61-68 before calling `limainstance.Delete()` at line 74. Lima's `Delete()` calls `StopForcibly(inst)` internally as a last-resort kill. Zeroing the PIDs disables that retry on the exact path where it matters most — when the first kill failed and the process is still alive.

The intent (prevent killing recycled PIDs) is sound and well-documented in the comment. The gap is that when `stopInstanceForcibly` just ran and failed, the PIDs are still valid — they have not been recycled because the process is still alive.

Fix: Return a success indicator from `stopInstanceForcibly` and only zero PIDs when the kill succeeded (or when the process was already dead):

```diff
-func stopInstanceForcibly(ctx context.Context, logger logr.Logger, inst *limatype.Instance) {
+func stopInstanceForcibly(ctx context.Context, logger logr.Logger, inst *limatype.Instance) bool {
     allKilled := true
     ...
-}
+    return allKilled
+}
```

Then in `handleDeletion`:

```diff
-    if existingInst.Status == limatype.StatusRunning {
-        stopInstanceForcibly(ctx, logger, existingInst)
+    allStopped := true
+    if existingInst.Status == limatype.StatusRunning {
+        allStopped = stopInstanceForcibly(ctx, logger, existingInst)
     }
     if runtime.GOOS == "windows" {
-        existingInst.DriverPID = 0
-        existingInst.HostAgentPID = 0
+        if allStopped {
+            existingInst.DriverPID = 0
+            existingInst.HostAgentPID = 0
+        }
     }
```

---

## Suggestions

1. **Unix `process.Kill` sends SIGTERM despite "force-kill" comment** — `service.go:367-375`, `process_unix.go:31-33` (suggestion, enhancement) [Codex GPT 5.4]

```go
case <-timeout:
    // Graceful shutdown timed out; force-kill so we don't leave
    // a hung service process.
    _ = process.Kill(pid)
```

The timeout branch comments say "force-kill," but `process.Kill` on Unix sends SIGTERM (line 32-33 of `process_unix.go`), not SIGKILL. Since `StopWithWait` already sent SIGINT at line 352, a process wedged in shutdown receives no stronger signal after the 60-second timeout. The process can remain running despite the command reporting failure.

Fix: Use `process.KillTree` or `unix.Kill(pid, unix.SIGKILL)` in the timeout branch to match the "force-kill" semantics the comment describes.

2. **`vm_template` and `vm_template_running` have identical bodies** — `bats/helpers/vm_template.bash:14-28` (suggestion, regression) [Claude Opus 4.6]

```bash
vm_template() {
    if is_windows; then
        _wsl2_template
    else
        _unix_template
    fi
}

vm_template_running() {
    if is_windows; then
        _wsl2_template
    else
        _unix_template
    fi
}
```

Both functions dispatch identically. The file header says `vm_template_running` "supports RDD_VM_TYPE on Unix," but so does `vm_template` — both call `_unix_template` which includes the `${RDD_VM_TYPE:+vmType: ${RDD_VM_TYPE}}` expansion.

Fix: Merge into one function, or add a comment explaining the planned divergence to prevent premature consolidation.

3. **`assert_process_alive` name misleads in predicate contexts** — `bats/helpers/os.bash:100` (suggestion, regression) [Claude Opus 4.6]

```bash
assert_process_alive() {
    local pid=$1
    if is_windows; then
        MSYS_NO_PATHCONV=1 tasklist.exe /FI "PID eq ${pid}" /NH 2>/dev/null | grep -qw "${pid}"
    else
        kill -0 "${pid}" 2>/dev/null
    fi
}
```

The `assert_` prefix implies a test-failing assertion, but the function is also used as a predicate in `try --until-fail -- assert_process_alive` (limavm-running.bats:284). BATS treats both patterns the same, so this is cosmetic. A name like `is_process_alive` would better match the predicate usage.

4. **Silent error handling for locked SSH key files** — `ssh_keys.go:40-44` (suggestion, gap) [Gemini 3.1 Pro]

```go
_ = os.Remove(privPath)
// Remove orphaned public key ...
_ = os.Remove(pubPath)
```

If a key file exists but cannot be removed (file lock, restricted permissions), `os.Remove` fails silently. The subsequent `os.Rename` then fails with a confusing "failed to rename" error on Windows (where rename fails if the destination exists).

Fix: Check the `os.Remove` error. If it is not `os.ErrNotExist`, return it with context ("failed to remove existing key") to surface the root cause.

---

## Design Observations

### Concerns

- **(future)** **PID-based process management on Windows is structurally fragile** [Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro] — The lifecycle controller holds a direct `os.Process` handle for the hostagent in `instanceStates`, yet cleanup routines reload raw PIDs from disk via `store.Inspect`. Passing the held handle to cleanup functions would eliminate PID recycling risks for the hostagent entirely. Windows Job Objects (noted in the TODO at `process_windows.go:75`) would extend reliable tracking to all descendants (driver, SSH forwarders), removing the `taskkill /T` dead-parent traversal limitation and the `StatusBroken` PID-zeroing workaround.

- **(future)** **`process.Interrupt` fails silently for Windows services** [Claude Opus 4.6] — `StopWithWait` at `service.go:346-351` tries `process.Interrupt` (which uses `GenerateConsoleCtrlEvent`), but this fails when the CLI and service lack a shared console. The fallback to `process.Kill` (`TerminateProcess`) bypasses graceful shutdown entirely, so `rdd svc stop` never achieves graceful hostagent shutdown on Windows. Job Objects or named events would fix this.

### Strengths

- **WSL2 to MSYS2 transition** [Codex GPT 5.4, Gemini 3.1 Pro] — Replacing the WSL2-hosted BATS runner with MSYS2 removes the flaky `binfmt_misc` interop layer at its root instead of working around it.
- **`stopInstanceForcibly` thoroughness** [Claude Opus 4.6] — Kills the process tree, terminates the WSL2 distro, unlocks disks only after confirming kills succeeded, and cleans up tmp files only when safe. The conditional cleanup based on `allKilled` guards against masking still-running processes.
- **`shutdownHostagent` consolidation** [Claude Opus 4.6] — Unifies the graceful-then-forced shutdown pattern (previously duplicated across `handleRestartNeeded`, `stopInstance`, and `killOrphanedHostagent`) into a single function with clear fallback semantics.
- **SSH key crash recovery** [Claude Opus 4.6] — `ensureSSHKeysAt` handles all crash-recovery states: missing public key, corrupt private key, stale temp files, and orphaned public key. The Windows `os.Rename` limitation is handled by removing targets before renaming.

---

## Testing Assessment

Test coverage is good. The BATS tests exercise process lifecycle, cross-platform process management, WSL2 cleanup with deadlock recovery, and SSH key generation. Untested scenarios ranked by risk:

1. **`shutdownAllHostagents` post-kill hang** — The `<-state.procExited` unbounded wait requires a hostagent whose children survive `KillTree` and hold pipes. No test exercises this path.
2. **Deletion after failed force-stop** — No test forces `stopInstanceForcibly` to fail during deletion and verifies the deletion aborts or retries rather than proceeding with zeroed PIDs.
3. **`StopWithWait` on Windows without shared console** — The graceful-shutdown-bypass path is untested; only the orphan recovery path is exercised.
4. **Stale PID files on Windows** — Acknowledged at lines 48-52 of `limavm_lifecycle.go` as requiring Windows-specific PID manipulation that BATS cannot reproduce.

---

## Documentation Assessment

No gaps. Inline comments explain Windows-specific behavior, platform asymmetries, and acknowledged limitations. The `vm_template.bash` header documents template selection.

---

## Acknowledged Limitations

1. **Windows Job Objects** — `process_windows.go:75`: The comment and PR discussion acknowledge that Job Objects would be a cleaner solution for process group management. The current `CREATE_NEW_PROCESS_GROUP` + `KillTree` approach works but has the `taskkill /T` dead-parent traversal limitation.

2. **PID recycling on Windows** — `limavm_lifecycle.go:210-218`: Documents that `stopInstanceForcibly` kills by PID without verifying process identity. The deletion path guards against this by zeroing PIDs, but the restart path does not. "The self-healing restart that follows limits the blast radius." The proper fix is process identity validation or Job Objects.

3. **Sequential hostagent shutdown** — `limavm_controller.go:504-505`: `TODO: Wait on all hostagents in parallel instead of sequentially.` Total shutdown time is N × gracefulShutdownTimeout in the worst case.

4. **Cross-console interrupt failure** — `service.go:346-351`: `GenerateConsoleCtrlEvent` fails without a shared console, falling back to `TerminateProcess` which bypasses graceful shutdown.

5. **`shutdownHostagent` force-stop isolation** — `limavm_lifecycle.go:542-544`: The force-stop fallback requires a hostagent that ignores shutdown signals; the orphaned-hostagent test exercises forced stop indirectly but does not isolate this timeout path.

---

## Unresolved Feedback

- **MSYS2/cygwin SSH dependency**: https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/242#discussion_r2976872168 — @mook-as asked whether MSYS2 SSH must always ship with RDD. @jandubois replied that `wsl.exe` will replace `ssh.exe` for shell access and the dependency is hardcoded in Lima upstream. No code change made; tracks a future dependency to eliminate.

- **`pacman.exe` path suggestion**: https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/242#discussion_r2976767538 — @mook-as suggested using the full `C:\msys64\usr\bin\pacman.exe` path in the workflow. No corresponding code change visible in subsequent commits.

---

## Agent Performance Retro

### [Claude]

Claude Opus 4.6 delivered the highest signal-to-noise ratio. Its sole important finding (`shutdownAllHostagents` unbounded wait) was unique, well-evidenced with a traced code path, and included a diff-format fix matching the repo's existing `waitAfterKill` pattern. It accurately identified the regression: pre-existing code fell through to `state.cancel()` immediately, while the new code adds a blocking point without a safety bound. The suggestions were minor but valid. Design observations were thorough and balanced, with four specific strengths and a well-scoped concern. No false positives.

### [Codex]

Codex GPT 5.4 produced the most architecturally interesting finding: the PID-zeroing gap in `handleDeletion`. It correctly traced through Lima's `Delete → StopForcibly` internal call chain to show that zeroing PIDs disables the retry. The Unix SIGTERM-vs-SIGKILL observation was valid though low-impact. Design observations were focused. No false positives.

### [Gemini]

Gemini 3.1 Pro flagged PID recycling as CRITICAL, but the code has extensive comments acknowledging this risk at lines 210-218 of `limavm_lifecycle.go`. Per review conventions, acknowledged limitations should not be reported as findings. The claim that Lima previously used `GenerateConsoleCtrlEvent` as a safety net against killing GUI apps overstates the behavior — `GenerateConsoleCtrlEvent` fails against non-console processes, but that is a limitation, not a safety mechanism. Severity inflated.

Gemini's IMPORTANT finding about orphaned port forwarders blocking restarts contradicts the code's own analysis at `process_windows.go:70-76` without providing evidence that Lima reuses the same SSH ports across hostagent restarts. Lima allocates dynamic ports, so orphaned forwarders holding old ports do not prevent new hostagents from binding fresh ones. The SSH key error handling suggestion was valid but low-impact.

### Summary

| Metric | Claude Opus 4.6 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 6:05 | 7:06 | 5:21 |
| Critical | 0 | 0 | 1 (downgraded) |
| Important | 1 | 1 | 1 (not consolidated) |
| Suggestion | 2 | 1 | 1 |
| Design observations | 5 | 3 | 3 |
| False positives | 0 | 0 | 2 |
| Unique insights | 1 | 1 | 0 |
| Files reviewed | 21 | 21 | 21 |
| Coverage misses | 0 | 0 | 0 |

Claude and Codex each contributed one unique, verified finding that the other agents missed. Gemini's findings either restated acknowledged limitations at inflated severity or contradicted the code's own analysis without supporting evidence. Claude provided the most comprehensive design observations.

---

## Appendix: Original Reviews

### Claude Opus 4.6

Now I have a thorough understanding of the full changeset. Let me write the review.

---

## Executive Summary

This PR replaces WSL2-based BATS testing with MSYS2 on Windows (avoiding flaky `binfmt_misc` interop failures), fixes Windows process signaling so hostagent lifecycle operations no longer kill the RDD service, and adds WSL2 distro termination to prevent `wslservice.exe` deadlocks during cleanup. The primary finding is a missing timeout in `shutdownAllHostagents` that could hang service shutdown.

## Findings

### Critical Issues

None.

### Important Issues

1. **`shutdownAllHostagents` blocks indefinitely after forced kill** — `limavm_controller.go:525` (important, regression)

```go
if !graceful {
    cleanupCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
    inst, err := store.Inspect(cleanupCtx, name)
    if err == nil && inst != nil {
        stopInstanceForcibly(cleanupCtx, logger, inst)
    }
    cancel()
    <-state.procExited  // blocks without timeout
}
```

After `stopInstanceForcibly`, the code waits on `<-state.procExited` with no timeout. This channel closes when `cmd.Wait()` returns, but `cmd.Wait()` blocks until all pipe reads complete. On Windows, if a child process (e.g., `wsl.exe` subprocess) survives `KillTree` and holds an inherited pipe, `cmd.Wait()` never returns and service shutdown hangs forever.

The sister function `shutdownHostagent` (line 533-536) correctly uses a 10-second bounded wait via `waitAfterKill`:

```go
waitAfterKill := func() {
    killCtx, killCancel := context.WithTimeout(context.Background(), 10*time.Second)
    defer killCancel()
    r.waitForProcessExit(killCtx, name)
}
```

The pre-existing code did not wait after `Process.Kill()` — it fell through to `state.cancel()` immediately. This change adds a new blocking point without a safety bound.

Fix: Add a timeout around the post-kill wait, matching `shutdownHostagent`:

```diff
 			stopInstanceForcibly(cleanupCtx, logger, inst)
 		}
 		cancel()
-		<-state.procExited
+		waitCtx, waitCancel := context.WithTimeout(context.Background(), 10*time.Second)
+		select {
+		case <-state.procExited:
+		case <-waitCtx.Done():
+			logger.Info("Timed out waiting for hostagent to exit after forced kill", "instance", name)
+		}
+		waitCancel()
```

### Suggestions

1. **`vm_template` and `vm_template_running` are identical** — `bats/helpers/vm_template.bash:14-28` (suggestion, regression)

```bash
vm_template() {
    if is_windows; then
        _wsl2_template
    else
        _unix_template
    fi
}

vm_template_running() {
    if is_windows; then
        _wsl2_template
    else
        _unix_template
    fi
}
```

These two functions have the same body. The file header says `vm_template_running` "supports RDD_VM_TYPE on Unix," but so does `vm_template` — both call `_unix_template` which includes the `${RDD_VM_TYPE:+vmType: ${RDD_VM_TYPE}}` expansion. If they're meant to diverge later, a comment explaining why would prevent a future "cleanup" from merging them prematurely.

Fix: Either merge into one function (callers use the same result) or add a comment explaining the intended future divergence.

2. **`assert_process_alive` name is misleading in assertion context** — `bats/helpers/os.bash:100` (suggestion, regression)

```bash
assert_process_alive() {
    local pid=$1
    if is_windows; then
        MSYS_NO_PATHCONV=1 tasklist.exe /FI "PID eq ${pid}" /NH 2>/dev/null | grep -qw "${pid}"
    else
        kill -0 "${pid}" 2>/dev/null
    fi
}
```

The function is named `assert_process_alive` (implying it fails the test on false), but it's used in both assertion contexts (`assert_process_alive "${OLD_HA_PID}"` at limavm-running.bats:235) and non-assertion contexts (`try --max 10 --delay 1 --until-fail -- assert_process_alive "${svc_pid}"` at line 284). In the `try --until-fail` usage, the function acts as a predicate, not an assertion. A name like `is_process_alive` would be more accurate for the predicate usage. Not a bug — BATS treats both patterns the same — but the naming implies test-failing semantics that don't always apply.

## Design Observations

**Strengths**

- The `stopInstanceForcibly` function is thorough: it kills the process tree, terminates the WSL2 distro, unlocks disks only after confirming kills succeeded, and cleans up tmp files only when safe. The conditional cleanup based on `allKilled` is a good guard against masking still-running processes.

- The `shutdownHostagent` refactoring consolidates the graceful→forced shutdown pattern (previously duplicated across `handleRestartNeeded`, `stopInstance`, and `killOrphanedHostagent`) into a single function with clear fallback semantics.

- The `ensureSSHKeysAt` implementation carefully handles all crash-recovery states: missing public key (derives from private), corrupt private key (full regeneration), stale temp files (cleaned before keygen), and orphaned public key (removed before generation). The Windows `os.Rename` limitation is correctly handled by removing targets before renaming.

- The `handleDeletion` PID-zeroing strategy (lines 61-68) is a clever workaround for Lima's ReadPIDFile behavior on Windows, preventing PID recycling from causing Lima's internal `StopForcibly` to kill unrelated processes.

**Concerns**

- **(future)** The `process.Interrupt` on Windows uses `GenerateConsoleCtrlEvent(CTRL_BREAK_EVENT, pid)` which requires the target to have been created with `CREATE_NEW_PROCESS_GROUP`. When called from `StopWithWait` on the RDD service itself (line 352 of service.go), the service was NOT necessarily started with `CREATE_NEW_PROCESS_GROUP`. The comment at lines 346-351 acknowledges this: if `Interrupt` fails (no shared console), `Kill` (TerminateProcess) is used, bypassing graceful shutdown. Hostagents self-heal via `killOrphanedHostagent` on next start. This is acceptable now but means `rdd svc stop` never gets graceful hostagent shutdown on Windows when the CLI lacks a shared console with the service. Windows Job Objects (already noted in a TODO at `process_windows.go:75`) would fix this.

## Testing Assessment

Test coverage is good. The BATS tests exercise:
- Process lifecycle: kill, crash recovery, orphan detection (limavm-running.bats)
- Cross-platform process management via `assert_process_alive`/`process_kill` helpers
- WSL2 cleanup with deadlock recovery (wslservice.exe kill fallback in cleanup test)
- SSH key generation via comprehensive unit tests (ssh_keys_test.go)

Untested scenarios (ranked by risk):
1. `shutdownAllHostagents` forced-kill timeout behavior — the `<-state.procExited` hang scenario is not exercised because it requires a hostagent whose children survive KillTree. The test comment at line 542-544 of limavm_lifecycle.go acknowledges this gap.
2. `StopWithWait` on Windows where `Interrupt` fails and `Kill` bypasses graceful shutdown — the orphan recovery path is tested, but the initial graceful-shutdown-bypass is not.
3. `stopInstanceForcibly` on StatusBroken WSL2 instances with stale PID files — acknowledged untested at lines 48-52 of limavm_lifecycle.go.

## Documentation Assessment

No documentation gaps. The PR has extensive inline comments explaining Windows-specific behavior, platform asymmetries, and acknowledged limitations. The `vm_template.bash` header documents the template selection mechanism.

## Acknowledged Limitations

1. **Windows Job Objects** — `process_windows.go:75`: "This sounds like something that jobs (Windows Job Objects) might be relevant for." The comment and PR discussion acknowledge that Job Objects would be a cleaner solution for process group management. The current `CREATE_NEW_PROCESS_GROUP` + `KillTree` approach works but has the documented `taskkill /T` dead-parent traversal limitation.

2. **PID recycling on Windows** — `limavm_lifecycle.go:210-218`: Documents that `stopInstanceForcibly` kills by PID without verifying process identity. The deletion path guards against this for StatusBroken instances, but the crash-recovery path does not. "The self-healing restart that follows limits the blast radius."

3. **Sequential hostagent shutdown** — `limavm_controller.go:504-505`: `TODO: Wait on all hostagents in parallel instead of sequentially.` This PR does not change the sequential nature, which could multiply shutdown time by the number of instances.

4. **StopWithWait on Windows** — `service.go:346-351`: Acknowledges that `GenerateConsoleCtrlEvent` fails without a shared console, falling back to `TerminateProcess` which bypasses graceful shutdown. This change makes the limitation slightly more visible by trying graceful first.

## Unresolved Feedback

- **MSYS2/cygwin ssh dependency**: https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/242#discussion_r2976872168 — @mook-as asked whether MSYS2 ssh must always ship with RDD. @jandubois replied that `wsl.exe` will be used instead of `ssh.exe` for shell access, and the current dependency is hardcoded in Lima. No code change was made or needed, but this tracks a future dependency to eliminate.

## Coverage Summary

| File | Status |
|------|--------|
| `.github/actions/spelling/expect/tech.txt` | Trivial (spelling list additions) |
| `.github/actions/spelling/expect/tools.txt` | Trivial (spelling list addition) |
| `.github/workflows/bats.yaml` | Reviewed, no issues |
| `bats/helpers/load.bash` | Trivial (one source line added) |
| `bats/helpers/os.bash` | **Suggestion 2** |
| `bats/helpers/utils.bash` | Reviewed, no issues |
| `bats/helpers/vm_template.bash` | **Suggestion 1** |
| `bats/tests/33-lima-controllers/limavm-instance.bats` | Reviewed, no issues |
| `bats/tests/33-lima-controllers/limavm-running.bats` | Reviewed, no issues |
| `pkg/controllers/lima/limavm/controllers/hostagent_watcher.go` | Reviewed, no issues |
| `pkg/controllers/lima/limavm/controllers/limavm_controller.go` | **Finding 1** |
| `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` | Reviewed, no issues |
| `pkg/controllers/lima/limavm/controllers/ssh_keys.go` | Reviewed, no issues |
| `pkg/controllers/lima/limavm/controllers/ssh_keys_test.go` | Reviewed, no issues |
| `pkg/controllers/lima/limavm/controllers/ssh_keys_unix.go` | Reviewed, no issues |
| `pkg/controllers/lima/limavm/controllers/ssh_keys_windows.go` | Reviewed, no issues |
| `pkg/external/main.go` | Reviewed, no issues |
| `pkg/service/cmd/service.go` | Reviewed, no issues |
| `pkg/util/process/process_unix.go` | Reviewed, no issues |
| `pkg/util/process/process_windows.go` | Reviewed, no issues |
| `scripts/collect-bats-logs.sh` | Reviewed, no issues |

### Codex GPT 5.4

Verdict: Needs changes

### Executive Summary

This PR substantially improves Windows Lima coverage by moving BATS to MSYS2, isolating hostagents into their own process groups, and hardening WSL2 cleanup paths. The main regression I found is in the new Windows delete flow: it clears Lima’s stored PIDs before calling `Delete()` even when the preceding force-stop failed, which removes the last remaining kill attempt and can leave a live VM process behind during deletion.

### Findings

#### Critical Issues

None.

#### Important Issues

1. **Deletion clears PIDs even after a failed force-stop** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:61-74` (important, regression)

```go
if runtime.GOOS == "windows" {
    existingInst.DriverPID = 0
    existingInst.HostAgentPID = 0
}
deleteCtx, deleteCancel := context.WithTimeout(ctx, 60*time.Second)
err = limainstance.Delete(deleteCtx, existingInst, true)
```

`stopInstanceForcibly()` only records kill failures by setting `allKilled = false` at lines 622-624, but it does not return that failure to `handleDeletion()`. `handleDeletion()` then unconditionally zeroes `existingInst.DriverPID` and `existingInst.HostAgentPID` at lines 61-68 before calling `limainstance.Delete()` at line 74. That matters because Lima’s `Delete()` immediately calls `StopForcibly(inst)` again before unregistering and removing files (`github.com/lima-vm/lima/v2/pkg/instance/delete.go:22`), so zeroing the PIDs here disables the only remaining kill attempt exactly when the first one failed.

On the failure path this leaves two bad states: a live driver/hostagent can survive while `Delete()` proceeds into unregister/remove, and the PR’s WSL deadlock workaround can regress back into hangs because the still-running process is now invisible to Lima’s cleanup. This behavior was introduced by lines 61-68 in commit `911b57e`; `git blame` confirms those lines are new in this PR.

Fix: make `stopInstanceForcibly()` return whether each PID was actually terminated, and only clear a PID after a fresh check confirms that specific process is gone. If the force-stop failed, abort deletion and requeue instead of calling `Delete()` with both PIDs zeroed.

#### Suggestions

1. **The new timeout “force-kill” path is not actually forceful on Unix** — `pkg/service/cmd/service.go:367-375` (suggestion, enhancement)

```go
case <-timeout:
    // Graceful shutdown timed out; force-kill so we don't leave
    // a hung service process.
    _ = process.Kill(pid)
    return fmt.Errorf("timed out waiting for %q control plane with pid %d to stop", instance.Name(), pid)
```

The timeout branch at lines 367-375 says it will “force-kill” the service, but on Unix `process.Kill()` is still just `SIGTERM` (`pkg/util/process/process_unix.go:31-33`). Because `StopWithWait()` already sent `SIGINT` at line 352, a process that is wedged in shutdown gets no stronger signal after the 60-second timeout and can remain running even though the command reports failure. This does not break the Windows fix, but it makes the new recovery path weaker than the comment suggests.

Fix: keep the initial graceful `Interrupt()`, but switch the timeout branch to a real forceful path on Unix, e.g. `KillTree(context.Background(), pid)` or `unix.Kill(pid, unix.SIGKILL)`.

### Design Observations

**Concerns**

1. `(future)` The Windows lifecycle still depends on PID files plus `taskkill`, which is exactly the combination that makes stale/recycled PID handling so fragile. Moving Lima-launched processes into a Windows Job Object, or retaining verified process handles instead of reloading raw PIDs from disk, would eliminate a whole class of delete/restart edge cases: dead parent with live children, recycled PID false positives, and the need to special-case `StatusBroken`.

**Strengths**

1. `(in-scope)` Replacing WSL-hosted BATS with MSYS2 is the right abstraction boundary. It removes the flaky `binfmt_misc` interop layer instead of trying to paper over it.
2. `(in-scope)` The new outer timeout in `limavm-running.bats` is a good defensive test change: it prevents one stuck `wsl.exe`/`wslservice.exe` path from wedging the whole suite.

### Testing Assessment

1. No test forces `process.KillTree()` to fail during Windows deletion and verifies that deletion aborts or requeues instead of zeroing the PIDs and calling `Delete()` anyway.
2. The stale/recycled PID cases explicitly called out in `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:210-218` are still untested, so the Windows safety argument remains comment-backed rather than exercised.
3. `pkg/external/main.go` adds `setVerbosityFromEnv()`, but there is no focused unit test for env/flag precedence or invalid `RDD_LOG_LEVEL` handling.

### Documentation Assessment

No user-facing documentation update looks necessary. The inline comment at `pkg/service/cmd/service.go:368-374` should be tightened if the timeout path remains `SIGTERM` on Unix, because it currently describes a forceful kill that the code does not perform.

### Acknowledged Limitations

1. `pkg/controllers/lima/limavm/controllers/limavm_controller.go:504-505`  
   “TODO: Wait on all hostagents in parallel instead of sequentially; with the current loop, the total wait is N × gracefulShutdownTimeout in the worst case.”  
   This PR makes that more noticeable because Windows shutdown is now intentionally more graceful.

2. `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:210-218`  
   The code explicitly notes that `StatusBroken` on Windows can carry stale recycled PIDs and that the proper fix is process identity validation or Job Objects. This PR reduces the risk on deletion, but the restart/self-heal path still relies on the weaker PID-based approach.

3. `bats/tests/33-lima-controllers/limavm-running.bats:262-264`  
   “Not tested: whether the orphaned hostagent shuts down gracefully … or is force-killed.”  
   The new tests confirm recovery, not which shutdown mechanism actually ran.

4. `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:542-544`  
   The code explicitly notes that the `shutdownHostagent()` force-stop fallback is not isolated by tests.

### Coverage Summary

- `.github/actions/spelling/expect/tech.txt` — Trivial
- `.github/actions/spelling/expect/tools.txt` — Trivial
- `.github/workflows/bats.yaml` — Reviewed, no issues
- `bats/helpers/load.bash` — Trivial
- `bats/helpers/os.bash` — Reviewed, no issues
- `bats/helpers/utils.bash` — Reviewed, no issues
- `bats/helpers/vm_template.bash` — Reviewed, no issues
- `bats/tests/33-lima-controllers/limavm-instance.bats` — Reviewed, no issues
- `bats/tests/33-lima-controllers/limavm-running.bats` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/hostagent_watcher.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/limavm_controller.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` — Finding 1
- `pkg/controllers/lima/limavm/controllers/ssh_keys.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/ssh_keys_test.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/ssh_keys_unix.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/ssh_keys_windows.go` — Reviewed, no issues
- `pkg/external/main.go` — Reviewed, no issues
- `pkg/service/cmd/service.go` — Finding 2
- `pkg/util/process/process_unix.go` — Finding 2
- `pkg/util/process/process_windows.go` — Reviewed, no issues
- `scripts/collect-bats-logs.sh` — Reviewed, no issues

### Gemini 3.1 Pro

### Executive Summary

This PR transitions the Windows BATS tests from WSL2 to MSYS2 to address interop instability, alongside fixing multiple Windows-specific process lifecycle and WSL2 deadlock issues in the Lima hostagent. While the architectural direction effectively isolates hostagents and recovers from hung distros, the switch to `taskkill /F /T /PID` introduces a critical safety regression by blindly killing recycled PIDs, which risks terminating unrelated user processes.

### Findings

#### Critical Issues

1. **Blind process tree termination via `taskkill` risks killing unrelated user processes** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:516-523`

```go
func stopInstanceForcibly(ctx context.Context, logger logr.Logger, inst *limatype.Instance) {
	allKilled := true
	for _, pid := range []int{inst.DriverPID, inst.HostAgentPID} {
		if pid > 0 {
			if err := process.KillTree(ctx, pid); err != nil {
```

The PR replaces Lima's `limainstance.StopForcibly` with `process.KillTree`, which on Windows uses `taskkill /F /T /PID`. Because Windows recycles PIDs aggressively, the `DriverPID` or `HostAgentPID` read from disk by `store.Inspect` may point to completely unrelated user applications. The author acknowledges this in a comment (lines 206-216), claiming "the self-healing restart that follows limits the blast radius." However, a VM restart does not mitigate the collateral damage of murdering a user's web browser, IDE, or system service. Previously, Lima's `StopForcibly` used `GenerateConsoleCtrlEvent`, which implicitly failed against GUI apps and non-console processes. Stripping this safety net and using a blind `taskkill /F /T` is a critical regression in user safety.

Fix: Do not use `taskkill /F /T` on PIDs read from disk without verifying the process identity. For `DriverPID`, verify the executable name (e.g., using `GetProcessImageFileName` via Go syscalls or WMI) before terminating, or at least restrict `taskkill` with an image name filter (`taskkill /F /T /PID <pid> /FI "IMAGENAME eq qemu-system-x86_64.exe"`). For `HostAgentPID`, use the `state.cmd.Process` handle instead of the PID from disk when available, as holding the handle prevents PID recycling.

#### Important Issues

2. **Orphaned child processes block self-healing restarts on Windows** — `pkg/util/process/process_windows.go:70-76`

```go
// Platform asymmetry: if the target process is already dead, taskkill /T
// returns exit code 128 (treated as success), but surviving children (e.g.,
// SSH port forwarders) are not killed because taskkill cannot traverse the
// tree from a dead parent. On Unix, kill(-pgid) still reaches all group
// members. This is acceptable: orphaned port forwarders cannot rebind their
// ports and are harmless.
```

The assumption that "orphaned port forwarders ... are harmless" is incorrect. If an orphaned SSH port forwarder is already bound to a port (e.g., a mapped user port), it holds that port open. When the controller attempts a self-healing restart, the new hostagent will fail to start because its new SSH port forwarder will receive an "Address already in use" error. The failure to kill surviving children directly breaks the self-healing mechanism the PR relies on.

Fix: Implement Windows Job Objects to track and terminate child processes reliably, or explicitly search for and terminate orphaned `ssh.exe` processes associated with the instance during forceful stops.

#### Suggestions

3. **Improve error handling for locked SSH keys** — `pkg/controllers/lima/limavm/controllers/ssh_keys.go:40-44`

```go
		_ = os.Remove(privPath)
	}
	// Remove orphaned public key (e.g. from a previous partial rename
	// failure). On Windows, os.Rename fails if the destination exists.
	_ = os.Remove(pubPath)
```

If `privPath` or `pubPath` exist but cannot be removed (e.g., due to file locks or restrictive permissions), `os.Remove` fails silently. The function then generates new temporary keys and attempts to rename them. On Windows, `os.Rename` fails if the destination already exists, resulting in a confusing "failed to rename private key" error.

Fix: Check the error returned by `os.Remove(privPath)` and `os.Remove(pubPath)`. If the error is not `os.ErrNotExist`, return it immediately to provide a clear diagnostic message ("failed to remove locked key") and avoid unnecessary key generation.

### Design Observations

#### Concerns
**Over-reliance on filesystem state for process management (in-scope)**
The lifecycle controller heavily relies on PID files read from `store.Inspect` to manage processes it spawned. Because the controller is the parent of the hostagent, it holds a direct `os.Process` handle in `instanceStates`. By passing this handle down to cleanup routines instead of relying on the filesystem, the controller could completely eliminate PID recycling risks for the hostagent. Windows Job Objects would extend this reliable tracking to all descendants (driver, ssh).

#### Strengths
- **WSL2 deadlock recovery:** Bounding `wsl.exe --unregister` and `wsl.exe --terminate` with context timeouts prevents stalled WSL subsystems from deadlocking the entire RDD control plane.
- **Robust glob alternative:** Using `os.ReadDir` with a suffix check instead of `filepath.Glob` in `stopInstanceForcibly` elegantly avoids Go's silent failure on Windows paths containing brackets (e.g., `C:\Users\name[1]`).

### Testing Assessment

1. **Stale PID file handling on Windows:** The BATS tests cannot easily simulate the Windows-specific PID recycling scenario (acknowledged in `handleDeletion`), leaving the `taskkill` collateral damage risk unverified.
2. **Graceful shutdown fallback:** The orphaned-hostagent test verifies recovery but does not isolate whether the hostagent shut down gracefully or hit the `gracefulShutdownTimeout` and was force-killed.

### Documentation Assessment

The comments clearly explain the intent and limitations of the platform-specific process management changes, specifically the discrepancies between Unix process groups and Windows `taskkill`.

### Commit Structure

The commits are logically separated and well-described. The progression from BATS MSYS2 adoption to process isolation and SSH key pre-generation is easy to follow.

### Acknowledged Limitations

- **Cross-console interrupt failure on Windows:** "On Windows, Interrupt uses GenerateConsoleCtrlEvent, which fails when caller and target lack a shared console; Kill (TerminateProcess) then bypasses graceful shutdown." (`pkg/service/cmd/service.go:344-348`). This change does not make the limitation worse; the previous code bypassed graceful shutdown entirely as well.

### Unresolved Feedback

None.

### Coverage Summary

- `.github/actions/spelling/expect/tech.txt` — Trivial
- `.github/actions/spelling/expect/tools.txt` — Trivial
- `.github/workflows/bats.yaml` — Reviewed, no issues
- `bats/helpers/load.bash` — Trivial
- `bats/helpers/os.bash` — Reviewed, no issues
- `bats/helpers/utils.bash` — Reviewed, no issues
- `bats/helpers/vm_template.bash` — Reviewed, no issues
- `bats/tests/33-lima-controllers/limavm-instance.bats` — Reviewed, no issues
- `bats/tests/33-lima-controllers/limavm-running.bats` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/hostagent_watcher.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/limavm_controller.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` — Finding 1
- `pkg/controllers/lima/limavm/controllers/ssh_keys.go` — Finding 3
- `pkg/controllers/lima/limavm/controllers/ssh_keys_test.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/ssh_keys_unix.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/ssh_keys_windows.go` — Reviewed, no issues
- `pkg/external/main.go` — Reviewed, no issues
- `pkg/service/cmd/service.go` — Reviewed, no issues
- `pkg/util/process/process_unix.go` — Reviewed, no issues
- `pkg/util/process/process_windows.go` — Finding 2
- `scripts/collect-bats-logs.sh` — Trivial
