# Deep Review: 20260323-161334-pr-242

| | |
|---|---|
| **Date** | 2026-03-23 16:13 |
| **Round** | 8 |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#242](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/242) — Run Lima BATS tests on Windows via MSYS2 |
| **Branch** | `msys2-ci` |
| **Commits** | `8c3ed7b` Fix Windows process signaling for Lima hostagent lifecycle<br>`16c73c0` Fix stopInstanceForcibly to terminate WSL2 distro and clean up files<br>`3ab20a2` Enable Lima BATS tests on Windows<br>`e3685b8` Pre-generate Lima SSH keys on Windows to avoid broken path conversion<br>`1f174c6` Run Windows BATS tests via MSYS2 instead of WSL2<br>`aea03ce` Honor RDD_LOG_LEVEL in external controllers<br>`8b7e51c` Address PR review feedback<br>`ae2aa0e` Address deep-review findings for PR #242 |
| **Reviewers** | Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro |
| **Verdict** | **Merge with fixes** — PID clearing before deletion is overly broad on Unix; one stale comment |
| **Wall-clock time** | `22 min 20 s` |

---

## Executive Summary

This PR migrates Windows BATS tests from WSL2 to MSYS2, eliminating flaky CI failures caused by WSL2 `binfmt_misc` interop breakage on GitHub Actions runners. It also reworks Windows process lifecycle management: hostagents now run in their own process group (`CREATE_NEW_PROCESS_GROUP`), graceful shutdown uses targeted `CTRL_BREAK_EVENT` instead of console-wide signals, and `stopInstanceForcibly` replaces Lima's `StopForcibly` to avoid killing the RDD service. SSH keys are pre-generated to work around Lima's MSYS2 path conversion dependency. The code is well-structured and thoroughly commented; the main finding is that PID clearing before `limainstance.Delete` protects against Windows PID recycling but unnecessarily disables force-stop for broken instances on Unix, where PID recycling is rare.

---

## Critical Issues

None.

---

## Important Issues

1. **PID clearing before deletion skips force-stop for broken instances on Unix** — `limavm_lifecycle.go:47-63` [Codex GPT 5.4, Gemini 3.1 Pro] (important, regression)

```go
if existingInst.Status == limatype.StatusRunning {
    stopInstanceForcibly(ctx, logger, existingInst)
} else if existingInst.VMType == limatype.WSL2 {
    terminateWSL2Distro(ctx, logger, existingInst.Name)
}
// Clear PIDs so Lima's Delete → StopForcibly does not kill
// unrelated processes if the PIDs were recycled (likely on Windows).
existingInst.DriverPID = 0
existingInst.HostAgentPID = 0
```

Only `StatusRunning` instances go through `stopInstanceForcibly` (line 52). For `StatusBroken` instances — where the hostagent socket probe fails but the process or its VM driver may still be alive — the code skips force-stop and then unconditionally clears both PID fields (lines 62–63) before calling `limainstance.Delete` (line 68). Lima's `Delete` internally calls `StopForcibly`, which checks these PIDs; with both zeroed, it skips process termination entirely.

The PID clearing addresses a real Windows concern: Lima's `ReadPIDFile` treats any live PID as valid, and Windows recycles PIDs aggressively. On Unix, however, PID recycling is rare (PIDs wrap around 32768+ values), and a broken instance's driver (e.g., QEMU) can genuinely be alive. Clearing PIDs unconditionally on all platforms trades a low-probability Windows risk for a resource leak on Unix.

Gemini rated this critical; Codex rated it important. Claude Opus called the PID clearing a strength. The actual severity is important: `StatusBroken` with live driver processes is an edge case (broken typically means the hostagent died, taking its children with it), but when it does occur, the leaked driver consumes significant resources until manual intervention.

Fix: restrict PID clearing to Windows, where PID recycling justifies it.

```diff
+	if runtime.GOOS == "windows" {
 		// Clear PIDs so Lima's Delete → StopForcibly does not kill
 		// unrelated processes if the PIDs were recycled (likely on Windows).
 		existingInst.DriverPID = 0
 		existingInst.HostAgentPID = 0
+	}
```

---

## Suggestions

1. **Stale comment references removed function** — `hostagent_watcher.go:78` [Claude Opus 4.6] (suggestion, gap)

```go
// every code path that cancels the watcher also kills the hostagent process
// (stopInstance, shutdownAllHostagents, StopForcibly on delete), so this
```

This PR replaces all calls to `limainstance.StopForcibly` with `stopInstanceForcibly`. The comment at line 78 still references `StopForcibly on delete`, which no longer exists in the call graph.

Fix:
```diff
 // every code path that cancels the watcher also kills the hostagent process
-// (stopInstance, shutdownAllHostagents, StopForcibly on delete), so this
+// (stopInstance, shutdownAllHostagents, stopInstanceForcibly on delete), so this
```

2. **Unbounded context for KillTree in shutdown path** — `limavm_controller.go:515` [Claude Opus 4.6] (suggestion, regression)

```go
_ = process.KillTree(context.Background(), state.cmd.Process.Pid)
```

`killHostagent` in `hostagent_watcher.go:228` uses `context.WithTimeout(context.Background(), 5*time.Second)` for the same `KillTree` call. The `shutdownAllHostagents` path at line 515 uses `context.Background()` without a timeout. On Windows, `KillTree` spawns `taskkill.exe` via `exec.CommandContext`; an unbounded context means a hung `taskkill` would block shutdown indefinitely. In practice `taskkill` completes instantly, but the inconsistency with the project's own pattern is worth fixing.

Fix:
```diff
+			killCtx, killCancel := context.WithTimeout(context.Background(), 5*time.Second)
-			_ = process.KillTree(context.Background(), state.cmd.Process.Pid)
+			_ = process.KillTree(killCtx, state.cmd.Process.Pid)
+			killCancel()
```

3. **Dead code in shutdownHostagent else-branch** — `limavm_lifecycle.go:535` [Claude Opus 4.6] (suggestion, regression)

```go
} else {
    logger.Info("No watcher for hostagent, forcing stop via stored PIDs")
    r.killHostagent(name) // no-op if watcher absent; forceStop uses stored PIDs
    forceStop()
    waitAfterKill()
}
```

The else-branch executes only when `signalHostagent` returned false (no watcher). `killHostagent` has the identical guard and will also be a no-op. The inline comment acknowledges this. Either remove the no-op call or remove the comment explaining it — the current state is harmless but adds noise.

---

## Design Observations

**Concerns**

1. **PID clearing scope is a cross-platform tradeoff, not a universal fix** — (in-scope) [Codex GPT 5.4, Gemini 3.1 Pro]

   The unconditional PID zeroing in `handleDeletion` treats all platforms the same, but the underlying concern (aggressive PID recycling) is Windows-specific. On Unix, PIDs wrap around a much larger space and recycling within the lifecycle of a single deletion is extremely unlikely. A `runtime.GOOS == "windows"` guard would preserve the Windows safety while allowing Lima's `Delete` to clean up live processes on Unix. This aligns with the existing platform-specific branching (`terminateWSL2Distro` is already Windows-only).

**Strengths**

- **Process group isolation**: Replacing Lima's console-wide `GenerateConsoleCtrlEvent(CTRL_BREAK)` with per-hostagent `CREATE_NEW_PROCESS_GROUP` + targeted `CTRL_BREAK_EVENT` eliminates the root cause of the "stopping a VM kills the service" bug. [Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro]
- **SSH key crash recovery**: `ensureSSHKeysAt` handles every on-disk state (both keys present, public missing, corrupt private, stale temps, orphaned public) with generate-to-temp-then-rename. The test suite covers all six states with real `ssh-keygen`, no mocks. [Claude Opus 4.6, Gemini 3.1 Pro]
- **WSL2 deadlock mitigation**: Running `wsl.exe --terminate` before `--unregister` plus the BATS `taskkill wslservice.exe` backstop prevents a single WSL deadlock from blocking the CI suite. [Claude Opus 4.6, Gemini 3.1 Pro]
- **`shutdownHostagent` consolidation**: Three separate copies of the signal-then-wait-then-kill pattern are extracted into a single method with correct background context handling. [Claude Opus 4.6]


---

## Testing Assessment

Process management assertions (`kill -0`, `kill -9`) are replaced with platform-aware helpers (`assert_process_alive`, `process_kill`). SSH key generation has 6 unit test cases covering all on-disk states with real `ssh-keygen`.

Untested scenarios ranked by risk:

1. **Deleting a `StatusBroken` instance with live driver PIDs on Unix** — the PID clearing means Lima's internal `StopForcibly` is bypassed, and no test verifies that orphaned drivers are cleaned up.
2. **`stopInstanceForcibly` force-stop path** — the code comment at `limavm_lifecycle.go:525-527` acknowledges this requires a hostagent that ignores shutdown signals, which is difficult to simulate.
3. **`Interrupt` → `Kill` fallback on Windows when caller and target lack a shared console** — documented at `service.go:346-351` but not exercised by tests.
4. **WSL2 deadlock recovery in production** — the `taskkill wslservice.exe` backstop exists only in BATS cleanup, not in the production code path.

---

## Documentation Assessment

Commit messages explain both what and why. Inline comments are thorough — the `stopInstanceForcibly` docstring, the `Interrupt` platform behavior comments in `service.go`, and the crash-recovery comments in `ensureSSHKeysAt` all explain non-obvious design decisions. The PR description should include the WSL2 flakiness rationale (as noted by mook-as in the review comments).

---

## Commit Structure

The 8 commits follow a logical progression: process signaling → forcible stop → WSL2 cleanup → SSH keys → BATS enablement → CI migration → log level propagation → review feedback. The two fixup commits (`8b7e51c`, `ae2aa0e`) address code review feedback and are appropriately separate for PR reviewability.

---

## Acknowledged Limitations

1. **`taskkill /T` cannot traverse dead parents** — `process_windows.go:70-75`:
   > "Platform asymmetry: if the target process is already dead, taskkill /T returns exit code 128 [...] surviving children (e.g., SSH port forwarders) are not killed [...] Windows Job Objects would fix this if needed."
   Currently acceptable; orphaned port forwarders cannot rebind their ports.

2. **Sequential hostagent shutdown** — `limavm_controller.go:504-505`:
   > "TODO: Wait on all hostagents in parallel instead of sequentially; with the current loop, the total wait is N × gracefulShutdownTimeout in the worst case."
   Pre-existing; this PR does not worsen it.

3. **Graceful shutdown bypass on Windows** — `service.go:346-351`:
   Starting the service in one console and stopping from another bypasses graceful interrupt, falling back to process termination. Handled by the daemon's orphan recovery mechanisms.

4. **Stale PID simulation untested** — `limavm_lifecycle.go:50-51`:
   > "Not tested: simulating stale PID files requires Windows-specific PID file manipulation that BATS cannot easily reproduce."
   This matters more after the PID-zeroing change, since the untested path now controls delete-time safety.

---

## Unresolved Feedback

All substantive PR review feedback has been addressed:
- Spelling words moved to `expect/*` files (commit `8b7e51c`)
- `--sync` long-form flag for pacman (commit `8b7e51c`)
- Job Objects suggestion acknowledged with existing TODO comment at `process_windows.go:75`
- SSH path dependency explained in [PR comment](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/242#discussion_r2977446774) (Lima dependency, plan to use `wsl.exe` for shell access)

---

## Agent Performance Retro

### [Claude] Opus 4.6

Claude produced the most thorough review (505s). It found three suggestions that no other agent raised: the stale `StopForcibly` comment, the unbounded `KillTree` context, and the dead `killHostagent` no-op. It traced call chains deeply (verifying `limainstance.StopForcibly` is no longer called anywhere, checking `killHostagent`'s guard against `signalHostagent`'s guard). However, it assessed the PID-clearing as a **strength** rather than a concern — the opposite conclusion from the other two agents. This reflects a design judgment call: Claude weighted the Windows PID recycling protection more heavily than the Unix resource leak risk. All 21 files reviewed.

### [Codex] GPT 5.4

Codex produced a focused review (347s, 73 lines). It identified the PID-clearing issue at IMPORTANT severity with a well-supported analysis tracing into Lima's `instance/delete.go` and `instance/stop.go`. Its fix suggestion (keep the workaround narrow — either force-stop broken instances or clear PIDs only after confirming they are stale) is sound. It missed the stale comment, the unbounded context, and the `setVerbosityFromEnv` issue (which Gemini found but was already fixed). All 21 files reviewed.

### [Gemini] 3.1 Pro

Gemini was the fastest agent (288s) and produced the sharpest analysis of the PID-clearing issue, though it over-rated the severity as CRITICAL. Its suggested fix (platform-gated PID clearing) is the most concrete. However, its second finding — missing `warn`/`error` mapping in `setVerbosityFromEnv` — is a **false positive**: commit `ae2aa0e` already added these levels to the case statement. Gemini appears to have read an earlier version of the diff. All 21 files reviewed.

### Summary

| Metric | Claude Opus 4.6 | Codex GPT 5.4 | Gemini 3.1 Pro |
|--------|-----------------|----------------|----------------|
| Duration | 8:25 | 5:47 | 4:48 |
| Critical | 0 | 0 | 1 (disputed → 0) |
| Important | 0 | 1 | 1 (false positive) |
| Suggestion | 3 | 0 | 0 |
| Design observations | 5 | 1 | 4 |
| False positives | 0 | 0 | 1 |
| Unique insights | 3 | 0 | 0 |
| Files reviewed | 21 | 21 | 21 |
| Coverage misses | 0 | 0 | 0 |

Claude provided the most unique value through its three suggestions that no other agent found. Codex and Gemini both identified the PID-clearing concern that Claude missed (or assessed differently). Gemini was fastest but had the only false positive. The multi-agent approach proved valuable: the PID-clearing issue and Claude's suggestions would have been missed by any single agent.

---

## Appendix: Original Reviews

### Claude Opus 4.6

Good — `limainstance.StopForcibly` is no longer called anywhere in the codebase (only mentioned in a comment). The reference in `hostagent_watcher.go:78` is the stale comment. Now let me write the complete review.

---

## Deep Code Review: PR #242 — Run Lima BATS tests on Windows via MSYS2

| Aspect | Assessment |
|--------|-----------|
| **Verdict** | Approve with minor suggestions |
| **Risk** | Low |
| **Complexity** | High — cross-platform process lifecycle, WSL2 deadlock prevention, CI migration |

### Executive Summary

This PR migrates Windows CI from WSL2-based BATS to MSYS2-based BATS, eliminating flaky failures caused by WSL2 `binfmt_misc` interop breakage. Alongside the CI migration, it implements correct Windows process isolation for hostagent management (`CREATE_NEW_PROCESS_GROUP` + `CTRL_BREAK_EVENT`), replaces Lima's `StopForcibly` (which used console-wide `GenerateConsoleCtrlEvent` that killed the RDD service) with targeted `KillTree`, and pre-generates SSH keys to work around Lima's broken MSYS2 path conversion. The process lifecycle changes are well-reasoned, handle edge cases carefully (PID recycling, WSL2 deadlocks, stale temp files), and are appropriately commented. No critical issues found.

### Findings

#### Critical Issues

None.

#### Important Issues

None.

#### Suggestions

1. **Stale comment references removed function** — `hostagent_watcher.go:78` (suggestion, gap)

```go
// every code path that cancels the watcher also kills the hostagent process
// (stopInstance, shutdownAllHostagents, StopForcibly on delete), so this
```

This PR replaces all calls to `limainstance.StopForcibly` with the new `stopInstanceForcibly` (confirmed: `limainstance.StopForcibly` is no longer called anywhere in the codebase). The comment at line 78 still references `StopForcibly on delete`, which now refers to a function that was removed from the call graph by this change.

Fix:
```diff
 // every code path that cancels the watcher also kills the hostagent process
-// (stopInstance, shutdownAllHostagents, StopForcibly on delete), so this
+// (stopInstance, shutdownAllHostagents, stopInstanceForcibly on delete), so this
```

2. **Unbounded context for `KillTree` in shutdown path** — `limavm_controller.go:515` (suggestion, regression)

```go
_ = process.KillTree(context.Background(), state.cmd.Process.Pid)
```

`killHostagent` in `hostagent_watcher.go:228` uses `context.WithTimeout(context.Background(), 5*time.Second)` for the same `KillTree` call. The `shutdownAllHostagents` path at line 515 uses `context.Background()` without a timeout. On Windows, `KillTree` spawns `taskkill.exe` via `exec.CommandContext`; an unbounded context means a hung `taskkill` would block shutdown indefinitely. In practice `taskkill` completes instantly, but this is inconsistent with the project's own pattern established 13 lines away in the same file.

Fix:
```diff
+			killCtx, killCancel := context.WithTimeout(context.Background(), 5*time.Second)
-			_ = process.KillTree(context.Background(), state.cmd.Process.Pid)
+			_ = process.KillTree(killCtx, state.cmd.Process.Pid)
+			killCancel()
```

3. **Dead code in `shutdownHostagent` else-branch** — `limavm_lifecycle.go:535` (suggestion, regression)

```go
} else {
    logger.Info("No watcher for hostagent, forcing stop via stored PIDs")
    r.killHostagent(name) // no-op if watcher absent; forceStop uses stored PIDs
    forceStop()
    waitAfterKill()
}
```

`killHostagent` at `hostagent_watcher.go:223` and `signalHostagent` at line 208 have identical guards (`!ok || state.cmd == nil || state.cmd.Process == nil`). The else-branch only executes when `signalHostagent` returns false, which guarantees `killHostagent` will also be a no-op. The inline comment acknowledges this ("no-op if watcher absent"), but the call adds noise. Since the comment already documents the no-op behavior, this is purely stylistic — either remove the call or remove the comment explaining it's a no-op.

### Design Observations

**Concerns**

1. **DriverPID may not be a process group leader on Unix** — (future)

   `stopInstanceForcibly` at `limavm_lifecycle.go:603` calls `KillTree` on both `DriverPID` and `HostAgentPID`. On Unix, `KillTree` uses `kill(-pid, SIGKILL)` which requires the PID to be a process group leader. When the driver (e.g., QEMU) is a child of the hostagent and shares its process group, `kill(-driverPID, SIGKILL)` returns `ESRCH` (mapped to nil) because `driverPID` isn't a group leader. This is harmless today — `KillTree` on the hostagent's group kills the driver as a member — but if a future VM driver runs in its own group, the iteration order (driver first, hostagent second) means the driver's group would be killed correctly. The current code is safe; the `ESRCH → nil` mapping at `process_unix.go:44-46` is the right defense.

**Strengths**

- **PID recycling defense**: Clearing `DriverPID` and `HostAgentPID` to 0 on the Go struct before passing to `limainstance.Delete` (`limavm_lifecycle.go:62-63`) prevents Lima's internal `StopForcibly` from killing recycled processes. This is a subtle Windows concern handled at exactly the right layer.
- **SSH key crash recovery**: `ensureSSHKeysAt` handles every on-disk state combination (both exist, public missing, corrupt private, stale temps, orphaned public) with correct cleanup ordering. The generate-to-temp-then-rename pattern avoids leaving partial keys. The test suite covers all six states with real `ssh-keygen` — no mocks.
- **WSL2 deadlock mitigation**: Running `wsl.exe --terminate` before `--unregister` (`handleDeletion` at lines 54-58) matches the known deadlock avoidance pattern. The BATS cleanup at `limavm-running.bats:719-724` adds a `taskkill wslservice.exe` recovery backstop, preventing a single WSL deadlock from blocking the entire CI suite.
- **`shutdownHostagent` consolidation**: The previous code had three separate copies of the signal-then-wait-then-kill pattern (in `stopInstance`, `handleRestartNeeded`, and `killOrphanedHostagent`). This PR extracts a single `shutdownHostagent` method with correct background context handling for the force-stop path.

### Testing Assessment

The existing BATS integration tests now run cross-platform (Unix and Windows/MSYS2). Process management assertions (`kill -0`, `kill -9`) are replaced with platform-aware helpers (`assert_process_alive`, `process_kill`). The SSH key generation has 6 unit test cases covering all on-disk states.

Specific untested scenarios (ranked by risk):

1. **`stopInstanceForcibly` force-stop path**: The code comment at `limavm_lifecycle.go:525-527` acknowledges this requires a hostagent that ignores shutdown signals, which is difficult to simulate in integration tests.
2. **Stale PID files pointing to recycled processes on Windows**: Comment at `limavm_lifecycle.go:50-51` notes this requires Windows-specific PID file manipulation that BATS cannot easily reproduce.
3. **`GenerateConsoleCtrlEvent` failure when caller and target lack a shared console**: The `StopWithWait` comment at `service.go:346-351` documents this Windows behavior, but no test exercises the `Interrupt` → `Kill` fallback path.
4. **WSL2 deadlock recovery in production**: The `taskkill wslservice.exe` backstop is only in the BATS cleanup, not in the production code path. In production, the 60-second `deleteCtx` timeout at `limavm_lifecycle.go:67` is the only defense.

### Documentation Assessment

The commit messages are thorough and explain both what and why. The inline code comments are excellent — the `stopInstanceForcibly` docstring at lines 590-600, the `Interrupt` platform behavior comments at `service.go:341-351`, and the crash-recovery comments in `ensureSSHKeysAt` all explain non-obvious design decisions. The PR description should include the WSL2 flakiness rationale (as noted by mook-as in the review comments).

### Commit Structure

The 7 PR commits follow a logical progression: process signaling → forcible stop → WSL2 cleanup → SSH keys → BATS enablement → CI migration → review feedback. The two fixup commits (`8b7e51c Address PR review feedback`, `ae2aa0e Address deep-review findings`) address code review feedback and are appropriately separate for PR reviewability.

### Acknowledged Limitations

1. **`taskkill /T` cannot traverse dead parents** — `process_windows.go:70-75`:
   > "Platform asymmetry: if the target process is already dead, taskkill /T returns exit code 128 [...] surviving children (e.g., SSH port forwarders) are not killed [...] Windows Job Objects would fix this if needed."
   This limitation is more likely to matter if hostagents start additional long-lived subprocesses beyond SSH forwarders. Currently acceptable.

2. **Sequential hostagent shutdown** — `limavm_controller.go:504-505`:
   > "TODO: Wait on all hostagents in parallel instead of sequentially; with the current loop, the total wait is N × gracefulShutdownTimeout in the worst case."
   This PR does not change this pre-existing limitation, though the new `stopInstanceForcibly` cleanup in the `!graceful` block at lines 519-528 adds more work per iteration. With the current single-VM usage pattern, the sequential wait is not a practical concern.

### Unresolved Feedback

All substantive PR review feedback has been addressed:
- Spelling words moved to `expect/*` files (commit `8b7e51c`)
- `--sync` long-form flag for pacman (visible in current diff at line 113)
- Job Objects suggestion acknowledged with existing TODO comment at `process_windows.go:75`
- SSH path dependency explained in PR comment (Lima dependency, will use `wsl.exe` for shell access)

### Coverage Summary

| File | Status |
|------|--------|
| `.github/actions/spelling/expect/tech.txt` | Trivial |
| `.github/actions/spelling/expect/tools.txt` | Trivial |
| `.github/workflows/bats.yaml` | Reviewed, no issues |
| `bats/helpers/load.bash` | Reviewed, no issues |
| `bats/helpers/os.bash` | Reviewed, no issues |
| `bats/helpers/utils.bash` | Reviewed, no issues |
| `bats/helpers/vm_template.bash` | Reviewed, no issues |
| `bats/tests/33-lima-controllers/limavm-instance.bats` | Reviewed, no issues |
| `bats/tests/33-lima-controllers/limavm-running.bats` | Reviewed, no issues |
| `pkg/controllers/lima/limavm/controllers/hostagent_watcher.go` | Finding 1 |
| `pkg/controllers/lima/limavm/controllers/limavm_controller.go` | Finding 2 |
| `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` | Finding 3 |
| `pkg/controllers/lima/limavm/controllers/ssh_keys.go` | Reviewed, no issues |
| `pkg/controllers/lima/limavm/controllers/ssh_keys_test.go` | Reviewed, no issues |
| `pkg/controllers/lima/limavm/controllers/ssh_keys_unix.go` | Reviewed, no issues |
| `pkg/controllers/lima/limavm/controllers/ssh_keys_windows.go` | Reviewed, no issues |
| `pkg/external/main.go` | Reviewed, no issues |
| `pkg/service/cmd/service.go` | Reviewed, no issues |
| `pkg/util/process/process_unix.go` | Reviewed, no issues |
| `pkg/util/process/process_windows.go` | Reviewed, no issues |
| `scripts/collect-bats-logs.sh` | Reviewed, no issues |

---

### Codex GPT 5.4

**Verdict**: Changes requested

### Executive Summary
This PR moves Windows BATS execution to MSYS2 and reworks the Lima hostagent/service shutdown path so Windows no longer relies on WSL interop for the test runner. The overall direction is good, but the new deletion path regresses cleanup for `Broken` Lima instances: it now clears Lima’s stored PIDs before calling `limainstance.Delete()`, which means live orphaned hostagent/driver processes are no longer killed during delete.

### Findings

**Critical Issues**
None.

**Important Issues**

1. **Deleting a broken instance can orphan the live hostagent/driver** — [pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go#L47](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go#L47) (important, regression)

```go
if existingInst.Status == limatype.StatusRunning {
    stopInstanceForcibly(ctx, logger, existingInst)
} else if existingInst.VMType == limatype.WSL2 {
    terminateWSL2Distro(ctx, logger, existingInst.Name)
}
existingInst.DriverPID = 0
existingInst.HostAgentPID = 0
err = limainstance.Delete(deleteCtx, existingInst, true)
```

At [limavm_lifecycle.go:52-63](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go#L52) only `StatusRunning` instances go through `stopInstanceForcibly()`, but [limavm_lifecycle.go:62-63](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go#L62) then zero both PID fields unconditionally before [limavm_lifecycle.go:68](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go#L68) calls Lima’s `Delete()`. That is unsafe because Lima can legitimately report `StatusBroken` while `HostAgentPID` and/or `DriverPID` are still live, for example when the hostagent socket probe fails or only one side of the pair is still running (`store/instance.go:66-80` and `:163-174` in the Lima dependency). Lima’s `Delete()` immediately calls `StopForcibly()`, and `StopForcibly()` only kills the PIDs present on the struct (`instance/delete.go:16-25`, `instance/stop.go:109-138`). After this change, those live `Broken`-state processes are skipped and the instance directory is removed underneath them.

Blame shows the problematic logic was introduced in `16c73c06` / `ae2aa0ec`.

Fix: keep the stale-PID workaround narrow. Either force-stop `Broken` instances before zeroing, or pass a copy into `limainstance.Delete()` and clear PID fields only after you have positively established that they are stale. Do not unconditionally blank the original `existingInst` for every non-running state.

**Suggestions**
None.

### Design Observations

**Strengths**
The move to explicit process-group management in [pkg/util/process/process_windows.go](/Users/jan/git/rancher-desktop-daemon/pkg/util/process/process_windows.go) and the controller-side shutdown sequencing is a good simplification. It removes dependence on console-wide CTRL_BREAK behavior and makes the Windows semantics much easier to reason about.

### Testing Assessment
1. The highest-risk missing scenario is deleting a `StatusBroken` Lima instance that still has a live hostagent or driver PID on Windows; the current BATS coverage exercises running/stopped cleanup, but not this broken-state delete path.
2. The Windows fallback path where graceful signaling cannot be delivered because the caller and target do not share a console is still only indirectly covered.
3. The SSH key pre-generation tests are useful, but they remain OS-agnostic; they do not exercise the Windows-specific executable/path environment that motivated the workaround.

### Documentation Assessment
No blocking documentation issues in the code. The new inline comments around Windows process handling are generally strong.

### Acknowledged Limitations
- “`Not tested: simulating stale PID files requires Windows-specific PID file manipulation...`” — [pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go#L50](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go#L50). This limitation matters more after the PID-zeroing change, because the untested path now controls delete-time safety.
- “`TODO: Wait on all hostagents in parallel instead of sequentially...`” — [pkg/controllers/lima/limavm/controllers/limavm_controller.go#L504](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_controller.go#L504). Still valid; this PR does not materially worsen it.

### Coverage Summary
- [.github/actions/spelling/expect/tech.txt](/Users/jan/git/rancher-desktop-daemon/.github/actions/spelling/expect/tech.txt) — Trivial
- [.github/actions/spelling/expect/tools.txt](/Users/jan/git/rancher-desktop-daemon/.github/actions/spelling/expect/tools.txt) — Trivial
- [.github/workflows/bats.yaml](/Users/jan/git/rancher-desktop-daemon/.github/workflows/bats.yaml) — Reviewed, no issues
- [bats/helpers/load.bash](/Users/jan/git/rancher-desktop-daemon/bats/helpers/load.bash) — Reviewed, no issues
- [bats/helpers/os.bash](/Users/jan/git/rancher-desktop-daemon/bats/helpers/os.bash) — Reviewed, no issues
- [bats/helpers/utils.bash](/Users/jan/git/rancher-desktop-daemon/bats/helpers/utils.bash) — Reviewed, no issues
- [bats/helpers/vm_template.bash](/Users/jan/git/rancher-desktop-daemon/bats/helpers/vm_template.bash) — Reviewed, no issues
- [bats/tests/33-lima-controllers/limavm-instance.bats](/Users/jan/git/rancher-desktop-daemon/bats/tests/33-lima-controllers/limavm-instance.bats) — Reviewed, no issues
- [bats/tests/33-lima-controllers/limavm-running.bats](/Users/jan/git/rancher-desktop-daemon/bats/tests/33-lima-controllers/limavm-running.bats) — Reviewed, no issues
- [pkg/controllers/lima/limavm/controllers/hostagent_watcher.go](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/hostagent_watcher.go) — Reviewed, no issues
- [pkg/controllers/lima/limavm/controllers/limavm_controller.go](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_controller.go) — Reviewed, no issues
- [pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go) — Finding 1
- [pkg/controllers/lima/limavm/controllers/ssh_keys.go](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/ssh_keys.go) — Reviewed, no issues
- [pkg/controllers/lima/limavm/controllers/ssh_keys_test.go](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/ssh_keys_test.go) — Reviewed, no issues
- [pkg/controllers/lima/limavm/controllers/ssh_keys_unix.go](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/ssh_keys_unix.go) — Reviewed, no issues
- [pkg/controllers/lima/limavm/controllers/ssh_keys_windows.go](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/ssh_keys_windows.go) — Reviewed, no issues
- [pkg/external/main.go](/Users/jan/git/rancher-desktop-daemon/pkg/external/main.go) — Reviewed, no issues
- [pkg/service/cmd/service.go](/Users/jan/git/rancher-desktop-daemon/pkg/service/cmd/service.go) — Reviewed, no issues
- [pkg/util/process/process_unix.go](/Users/jan/git/rancher-desktop-daemon/pkg/util/process/process_unix.go) — Reviewed, no issues
- [pkg/util/process/process_windows.go](/Users/jan/git/rancher-desktop-daemon/pkg/util/process/process_windows.go) — Reviewed, no issues
- [scripts/collect-bats-logs.sh](/Users/jan/git/rancher-desktop-daemon/scripts/collect-bats-logs.sh) — Reviewed, no issues

---

### Gemini 3.1 Pro

### Executive Summary

This PR migrates the Windows integration test runner from WSL2 to MSYS2 to eliminate flaky GitHub Actions failures caused by WSL interop bugs. It also vastly improves process lifecycle management on Windows by utilizing `CREATE_NEW_PROCESS_GROUP` and tree-based killing, which prevents stopping a VM from inadvertently terminating the RDD service. While the changes are highly effective, the Windows-specific protection against recycling PIDs inadvertently introduces a severe resource leak on Unix platforms when broken instances are deleted.

### Findings

#### Critical Issues

1. **Clearing PIDs before deletion orphans VM drivers for `StatusBroken` instances on Unix** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:44-67`

```go
		if existingInst.Status == limatype.StatusRunning {
			stopInstanceForcibly(ctx, logger, existingInst)
		} else if existingInst.VMType == limatype.WSL2 {
			terminateWSL2Distro(ctx, logger, existingInst.Name)
		}
		// Clear PIDs so Lima's Delete → StopForcibly does not kill
		// unrelated processes if the PIDs were recycled (likely on Windows).
		existingInst.DriverPID = 0
		existingInst.HostAgentPID = 0
		logger.Info("Deleting Lima instance", "instance", limaVM.Name)
```

By intentionally bypassing `stopInstanceForcibly` for `StatusBroken` instances to avoid killing recycled PIDs on Windows, and then clearing the PIDs right before calling Lima's `limainstance.Delete`, the code leaves the underlying VM driver (e.g., QEMU or VZ on macOS/Linux) running indefinitely. Previously, Lima's `Delete` method would fall back to its own internal `StopForcibly` call using the populated PIDs, terminating the driver. Now that the PIDs are cleared, the driver is orphaned, resulting in a severe resource leak (gigabytes of RAM and CPU) every time a broken instance is deleted.

**Fix:** Apply the `StatusBroken` bypass and PID clearing exclusively to Windows, where PID recycling is aggressive enough to warrant it. On Unix, PID recycling is slower, and terminating the driver is both safe and necessary.

```diff
-		if existingInst.Status == limatype.StatusRunning {
+		if existingInst.Status == limatype.StatusRunning || (runtime.GOOS != "windows" && existingInst.Status == limatype.StatusBroken) {
 			stopInstanceForcibly(ctx, logger, existingInst)
 		} else if existingInst.VMType == limatype.WSL2 {
 			// A "stopped" WSL2 distro can retain kernel state that deadlocks
 			// wsl.exe --unregister. Terminate it without PID-based killing,
 			// since the PIDs may have been recycled on Windows.
 			terminateWSL2Distro(ctx, logger, existingInst.Name)
 		}
-		// Clear PIDs so Lima's Delete → StopForcibly does not kill
-		// unrelated processes if the PIDs were recycled (likely on Windows).
-		existingInst.DriverPID = 0
-		existingInst.HostAgentPID = 0
+		if runtime.GOOS == "windows" {
+			// Clear PIDs so Lima's Delete → StopForcibly does not kill
+			// unrelated processes if the PIDs were recycled (likely on Windows).
+			existingInst.DriverPID = 0
+			existingInst.HostAgentPID = 0
+		}
 		logger.Info("Deleting Lima instance", "instance", limaVM.Name)
```

*(critical, regression)*

#### Important Issues

2. **`setVerbosityFromEnv` regression drops `warn`/`error` mapping** — `pkg/external/main.go:275-279`

```go
	case "", "info":
		// info is the default klog verbosity; nothing to change.
	default:
		// klog has no warn/error levels; unsupported values fall through to
		// the default verbosity (v=0), which is equivalent to info-only.
		fmt.Fprintf(os.Stderr, "RDD_LOG_LEVEL=%q is not supported for klog; valid values are trace, debug, info\n", level)
```

The `rdd` CLI properly supports logging levels like `warn`, `error`, and `fatal` because it is backed by Logrus. However, `setVerbosityFromEnv` maps `RDD_LOG_LEVEL` for external controllers using `klog`. By removing the explicit fallthrough cases for `warn`, `error`, etc. (which were present in an earlier iteration of the PR), any user legitimately running `rdd svc start --log-level warn` will now trigger misleading warning messages printed to `os.Stderr` by every spawned external controller on startup.

**Fix:** Restore the missing log levels to the empty case to silently map them to `info`.

```diff
-	case "", "info":
+	case "", "info", "warn", "warning", "error", "fatal":
 		// info is the default klog verbosity; nothing to change.
```

*(important, regression)*

#### Suggestions

None.

### Design Observations

#### Concerns

None.

#### Strengths

- **Graceful degradation of context termination:** The implementation inside `shutdownAllHostagents` expertly handles the `context.Cancelled` scenario. The RDD daemon enforces a hard 45-second cap on manager teardown, guaranteeing that a stuck `process.KillTree` (e.g. frozen by an antivirus filter or driver lock) does not leave the service permanently hung.
- **Resilient file I/O against partial state:** The `ensureSSHKeysAt` logic accounts for almost every realistic failure scenario, including half-written temp files, orphaned public keys without a matching private key, and corrupted data.
- **Judicious MSYS2 / WSL interoperability:** Running MSYS2 strictly for BATS orchestration while interacting natively with the Windows OS is a massive stability improvement over fighting `binfmt_misc` inside WSL.

### Testing Assessment

The test coverage is solid. The BATS harness efficiently checks system states, process recoveries, and file synchronization differences across Windows vs. Unix variants.

**Untested Scenarios:**
1. Verifying that a `StatusBroken` hostagent correctly triggers a driver termination on Unix when deleted (associated with Finding 1).

### Documentation Assessment

Internal documentation and code comments are thorough and clear. Specifically, the rationale surrounding platform asymmetries (Unix signals vs. Windows Console Events) correctly prepares future maintainers for these quirks.

### Commit Structure

The commit history clearly isolates conceptual changes: process grouping signaling, WSL2 file locks, BATS testing adjustments, and external log levels. This PR scope is cohesive despite spanning multiple domains because the shared goal revolves around isolating processes and securing tests across operating systems. 

### Acknowledged Limitations

- **`taskkill` cannot traverse tree from a dead parent:** `pkg/util/process/process_windows.go:78` notes that if a process is already dead, taskkill exit code 128 is treated as a success, though remaining port-forwarders cannot be killed. As noted, this is an acceptable tradeoff since orphaned port forwarders are harmless.
- **Sequential hostagent waits:** `pkg/controllers/lima/limavm/controllers/limavm_controller.go:504` acknowledges that waiting on `N` hostagents occurs sequentially, potentially reaching `N * timeout`.
- **Graceful shutdown bypass on Windows:** `pkg/service/cmd/service.go:348` outlines that starting the service in one console and stopping it from another bypasses graceful interrupt, falling back directly to process termination. This is a fundamental Windows API limitation well handled by the daemon's orphan recovery mechanisms.
- **Simulating stale PID files for testing:** `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:49` mentions that creating specific test conditions for stale PID files on Windows would require overly-complex OS manipulation that the BATS tests cannot easily simulate.

### Coverage Summary

- `.github/actions/spelling/expect/tech.txt` — Trivial
- `.github/actions/spelling/expect/tools.txt` — Trivial
- `.github/workflows/bats.yaml` — Reviewed, no issues
- `bats/helpers/load.bash` — Trivial
- `bats/helpers/os.bash` — Reviewed, no issues
- `bats/helpers/utils.bash` — Reviewed, no issues
- `bats/helpers/vm_template.bash` — Reviewed, no issues
- `bats/tests/33-lima-controllers/limavm-instance.bats` — Reviewed, no issues
- `bats/tests/33-lima-controllers/limavm-running.bats` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/hostagent_watcher.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/limavm_controller.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` — **Finding 1**
- `pkg/controllers/lima/limavm/controllers/ssh_keys.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/ssh_keys_test.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/ssh_keys_unix.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/ssh_keys_windows.go` — Reviewed, no issues
- `pkg/external/main.go` — **Finding 2**
- `pkg/service/cmd/service.go` — Reviewed, no issues
- `pkg/util/process/process_unix.go` — Reviewed, no issues
- `pkg/util/process/process_windows.go` — Reviewed, no issues
- `scripts/collect-bats-logs.sh` — Reviewed, no issues
