# Deep Review: 20260324-181620-pr-242

| | |
|---|---|
| **Date** | 2026-03-24 18:16 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 12 (of PR #242) |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#242](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/242) — Run Lima BATS tests on Windows via MSYS2 |
| **Branch** | `msys2-ci` |
| **Commits** | `37092d9` Split app controller Lima template by platform<br>`044a161` Enable App controller BATS tests on Windows<br>`a4263f9` Address review #11 findings for PR #242<br>`efa39ad` Address review #10 findings for PR #242<br>`f6c4efc` Address review #9 findings for PR #242<br>`271965a` Address review #8 findings for PR #242<br>`69fa6fd` Address deep-review findings for PR #242<br>`8dae15f` Address PR review feedback<br>`acfe13e` Honor RDD_LOG_LEVEL in external controllers<br>`5d0d146` Run Windows BATS tests via MSYS2 instead of WSL2<br>`d28899f` Pre-generate Lima SSH keys on Windows to avoid broken path conversion<br>`62c8004` Enable Lima BATS tests on Windows<br>`dcd9c75` Fix stopInstanceForcibly to terminate WSL2 distro and clean up files<br>`bccfc5d` Fix Windows process signaling for Lima hostagent lifecycle |
| **Reviewers** | Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro |
| **Verdict** | **Merge with fixes** — Two observability and safety gaps in Windows process shutdown paths |
| **Wall-clock time** | `23 min 53 s` |

---

## Executive Summary

This PR moves Windows BATS integration tests from WSL2 to MSYS2 (eliminating flaky `binfmt_misc` interop failures), implements Windows-specific process isolation so hostagent management no longer kills the RDD service, fixes `stopInstanceForcibly` to handle WSL2 distro termination and stale PID/socket cleanup, and adds SSH key pre-generation to avoid MSYS2 path conversion issues. The changes are well-structured, handling platform asymmetries carefully with thorough documentation. Two important issues need attention: the `StopWithWait` timeout path can kill a recycled PID on Windows (regression), and `shutdownAllHostagents` silently discards `store.Inspect` errors (regression).

---

## Critical Issues

None.

---

## Important Issues

1. **`StopWithWait` timeout can kill a recycled PID on Windows** — `pkg/service/cmd/service.go:340-375` [Codex GPT 5.4]

```go
pid := PID()                              // line 340: captured once
// ... 60s timeout loop ...
case <-timeout:
    _ = process.Kill(pid)                  // line 375: kills captured PID
```

`StopWithWait()` captures `pid := PID()` at line 340, then spins for up to 60 seconds polling `Running()`. The `process.Kill(pid)` at line 375 was added by this PR (commit `bccfc5d`); the pre-existing code just returned the timeout error without killing. On Windows, if the service exits after the initial `Interrupt` signal but before `Running()` detects it (e.g., PID file persists because the exit handler didn't run), and Windows recycles that PID to an unrelated process within the remaining timeout window, `process.Kill` at line 375 calls `TerminateProcess` on the wrong process. The `Running()` check at line 378 also uses `PID()` which reads the PID file — on Windows, `os.FindProcess` succeeds for any live PID, so a recycled PID looks "still running." (important, regression)

Fix: Re-read the PID before the timeout kill and only proceed if it matches the originally captured value. Better yet, open a process handle at line 340 and use it for both waiting and killing, eliminating the PID-identity ambiguity entirely:

```diff
  case <-timeout:
+     currentPID := PID()
+     if currentPID != pid {
+         return fmt.Errorf("service exited but PID file is stale")
+     }
      _ = process.Kill(pid)
```

2. **`shutdownAllHostagents` silently ignores `store.Inspect` failure** — `pkg/controllers/lima/limavm/controllers/limavm_controller.go:519-523` [Claude Opus 4.6]

```go
cleanupCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
inst, err := store.Inspect(cleanupCtx, name)
if err == nil && inst != nil {
    stopInstanceForcibly(cleanupCtx, logger, inst)
}
cancel()
```

When `store.Inspect` fails (instance directory gone, permission error, corrupt YAML), the error is silently discarded — no log, no fallback. The old code (pre-`bccfc5d`) used `state.cmd.Process.Kill()` which reached the process via its OS handle regardless of Lima store state. The new code's `stopInstanceForcibly` is more capable (tree kill, WSL2 termination, tmp cleanup) but requires a successful Inspect. The system self-heals on next start via `killOrphanedHostagent`, so this is not a data-loss risk — but the silent error drop makes debugging shutdown hangs harder. (important, regression)

Fix: Log the error and fall back to direct process kill via the handle already available in `state.cmd`:

```diff
  inst, err := store.Inspect(cleanupCtx, name)
  if err == nil && inst != nil {
      stopInstanceForcibly(cleanupCtx, logger, inst)
+ } else if err != nil {
+     logger.Info("Failed to inspect instance for forceful stop, falling back to direct kill",
+         "instance", name, "error", err)
+     if state.cmd != nil && state.cmd.Process != nil {
+         _ = process.KillTree(cleanupCtx, state.cmd.Process.Pid)
+     }
  }
```

---

## Suggestions

1. **Defensive newline in YAML template concatenation** — `pkg/controllers/app/app/controller.go:32-35` [Gemini 3.1 Pro]

```go
func limaTemplateData() string {
	images := limaImagesUnix
	if runtime.GOOS == "windows" {
		images = limaImagesWSL2
	}
	return images + limaTemplate
}
```

The embedded YAML strings are concatenated directly. If a future editor modifies `lima-images-wsl2.yaml` or `lima-images-unix.yaml` and drops the trailing newline, the resulting YAML will be corrupted (e.g., `mountType: wsl2cpus: 2`). An explicit `"\n"` between them costs nothing and prevents a subtle failure mode. (suggestion, enhancement)

Fix:

```diff
-	return images + limaTemplate
+	return images + "\n" + limaTemplate
```

2. **`process` package has no unit tests** — `pkg/util/process/process_unix.go`, `process_windows.go` [Claude Opus 4.6]

`SetGroup`, `Interrupt`, `Kill`, and `KillTree` are all new functions introduced by this PR. The BATS integration tests exercise them end-to-end, but targeted unit tests would catch regressions faster (e.g., verifying `KillTree` returns nil for a dead process, `SetGroup` preserves existing `SysProcAttr` fields, `taskkillExitNotFound` constant handling). (suggestion, gap)

Fix: Add `process_test.go` with subtests for each function.

3. **`limavm-instance.bats` missing `wslservice.exe` recovery on delete wait** — `bats/tests/33-lima-controllers/limavm-instance.bats:165` [Claude Opus 4.6]

`limavm-running.bats` wraps `rdd ctl wait --for=delete` with `timeout 90` and kills `wslservice.exe` on hang (lines 711-716). `limavm-instance.bats` uses a plain `rdd ctl wait --for=delete ... --timeout=60s` at line 165 without the wslservice recovery pattern. If `wsl.exe --unregister` deadlocks `wslservice.exe` during the instance test's delete, the test suite blocks until the CI job times out. The risk is lower (instance tests don't boot VMs), but the defensive pattern costs nothing to add. (suggestion, gap)

4. **`handleDeletion` timeout derives from reconciler context** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:79` [Claude Opus 4.6]

```go
deleteCtx, deleteCancel := context.WithTimeout(ctx, 60*time.Second)
```

`ctx` is the reconciler context. If `stopInstanceForcibly` + `terminateWSL2Distro` consumed time before reaching this line, the effective timeout may be less than 60s. Using `context.Background()` as the parent (like `shutdownHostagent`'s `forceStop` closure at line 519) would guarantee the full 60s. The reconciler retries on failure, so the impact is limited. (suggestion, gap)

---

## Design Observations

### Concerns

1. **Process lifecycle on Windows depends on PID files and console/process-group semantics** — The force-stop fallback paths in `shutdownHostagent` (line 521) and `killOrphanedHostagent` (line 587) reuse `inst` snapshots captured before a 30-second graceful shutdown wait, creating a window where PIDs could be recycled on Windows. The `handleWatchedState` deletion path at line 45-71 already has Windows-specific guards for this. Using Job Objects for hostagents would eliminate PID-identity ambiguity entirely, make descendant cleanup deterministic, and remove the `taskkill /T` dead-parent asymmetry. `(future)` [Codex GPT 5.4, Claude Opus 4.6]

### Strengths

- **Process isolation model** is well-designed: `CREATE_NEW_PROCESS_GROUP` on Windows mirrors Unix `Setpgid`, and `process.Interrupt` maps SIGINT/CTRL_BREAK correctly per platform. The `signalHostagent` simplification (removing the `os.Signal` parameter) eliminates flexibility that was never used and wouldn't work cross-platform. [Claude Opus 4.6]

- **`stopInstanceForcibly` consolidation** is a significant improvement. It handles the full cleanup lifecycle (tree kill → disk unlock → WSL2 terminate → tmp file cleanup) with proper state guards. The `filenames.TmpFileSuffixes` usage and `os.ReadDir` over `filepath.Glob` (avoiding bracket meta-character issues) show attention to Windows path edge cases. [Claude Opus 4.6, Gemini 3.1 Pro]

- **SSH key pre-generation** (`ensureSSHKeysAt`) carefully enumerates all partial-state combinations (missing pub, corrupt priv, orphaned pub, stale temps) and the test suite covers each. The generate-then-rename pattern prevents partial keys from blocking future attempts. [Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro]

- **`shutdownHostagent` consolidation** replaced three separate "signal, wait, force-kill" implementations (`stopInstance`, `handleRestartNeeded`, `killOrphanedHostagent`) with a single function, eliminating duplication and fixing the `shutdownAllHostagents` hang (unbounded `<-state.procExited` wait replaced by 10s timeout). [Claude Opus 4.6]

- **`setVerbosityFromEnv`** bridges the logrus/klog level gap cleanly, and calling it before `flag.Parse()` preserves explicit `-v` precedence. [Claude Opus 4.6]

- **MSYS2 migration** addresses the flaky layer directly instead of adding retries around WSL2 interop, which is the right architectural choice. [Codex GPT 5.4]

---

## Testing Assessment

Test coverage is solid. BATS tests were systematically updated: `skip_on_windows` removed, cross-platform helpers added (`assert_process_alive`, `process_kill`, `editor_cmd`, `vm_template`), and WSL cleanup added to teardown. Untested scenarios ranked by risk:

1. `shutdownHostagent`'s force-stop timeout path (requires a hostagent that ignores shutdown signals) — acknowledged in code comment at line 548-550
2. `process.KillTree` behavior when hostagent is dead but children survive on Windows — documented in `process_windows.go:74-79`
3. Windows `StopWithWait` timeout-kill path with stale PID files — no test coverage
4. Stale PID files on Windows pointing to recycled processes — acknowledged in `limavm_lifecycle.go:48-52`
5. `StopWithWait` when `Interrupt` succeeds but the event isn't delivered (no shared console on Windows) — documented in `service.go:346-351`

---

## Documentation Assessment

Changed code has thorough inline documentation. The platform-specific comments (WSL2 deadlock, PID recycling, CTRL_BREAK behavior) are valuable for future maintainers. `docs/design/api_lima.md` line 175 references "SIGTERM/SIGKILL" for orphan recovery, while the implementation now uses `process.Interrupt` (SIGINT/CTRL_BREAK) and `KillTree` — a minor doc staleness. [Codex GPT 5.4]

---

## Commit Structure

The commit history is clean and logical: each commit represents a self-contained concept (process signaling, WSL2 cleanup, SSH keys, MSYS2 CI, log level propagation, app controller split). The `Address review #N` commits address prior review feedback and are appropriately separate. No fixup commits that should have been squashed.

---

## Acknowledged Limitations

1. **PID recycling on Windows** — `limavm_lifecycle.go:48-52`: "Broken instances may have stale PID files pointing to recycled processes on Windows (Lima's ReadPIDFile treats any live PID as valid)." Also documented at line 215-224. The change makes this more visible by documenting it explicitly. Impact unchanged by this PR; Windows PID recycling was always a risk.

2. **`taskkill /T` dead-parent asymmetry** — `process_windows.go:69-79`: "if the target process is already dead, taskkill /T returns exit code 128 (treated as success), but surviving children... are not killed because taskkill cannot traverse the tree from a dead parent." Acknowledged with "Windows Job Objects would fix this if needed."

3. **Sequential hostagent shutdown** — `limavm_controller.go:504-505`: "TODO: Wait on all hostagents in parallel instead of sequentially; with the current loop, the total wait is N × gracefulShutdownTimeout in the worst case." Pre-existing, unchanged by this PR.

4. **Force-stop timeout path not tested** — `limavm_lifecycle.go:548-550`: "Not tested: the forceStop fallback requires a hostagent that ignores shutdown signals."

5. **Hostagent shutdown mechanism not tested on Windows** — `limavm-running.bats:261-264`: "Not tested: whether the orphaned hostagent shuts down gracefully (via CTRL_BREAK/SIGINT) or is force-killed."

6. **Console limitation for Windows signals** — `service.go:346-351`: "On Windows, Interrupt uses GenerateConsoleCtrlEvent, which fails when caller and target lack a shared console; Kill (TerminateProcess) then bypasses graceful shutdown." Relies on `killOrphanedHostagent` to self-heal on the next launch.

---

## Unresolved Feedback

1. **`--sync` instead of `-S` for `pacman`** — @mook-as suggested using the long-form `--sync` flag ([comment](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/242#discussion_r2976767538)). The author reacted with +1 but no subsequent commit addresses this.

---

## Agent Performance Retro

### [Claude]

Claude Opus 4.6 delivered a thorough review in 10m4s (141 lines). Found the `shutdownAllHostagents` silent error drop (unique), identified the missing `wslservice.exe` recovery in `limavm-instance.bats` (unique), and noted the `handleDeletion` context timeout issue (unique). Provided strong design observations with detailed strengths. Did not identify the `StopWithWait` PID recycling issue that Codex found. No false positives. Good use of `git blame` to verify regression vs. pre-existing code.

### [Codex]

Codex GPT 5.4 completed in 7m39s (103 lines). Found the `StopWithWait` PID recycling regression (unique) — the strongest finding of the review. Also identified the stale PID snapshot gap in force-stop fallbacks and flagged stale design docs (unique). Focused and precise — every finding was verified and actionable. Did not notice the `shutdownAllHostagents` silent error drop. No false positives.

### [Gemini]

Gemini 3.1 Pro was the fastest at 5m46s (98 lines) and delivered a clean, complete review with full coverage. Found the YAML concatenation newline safety issue (unique). However, it found only 1 suggestion-level finding in a 26-file, 925-insertion PR — likely under-flagging. Marked every substantive file as "Reviewed, no issues" including files where other agents found real issues (e.g., `limavm_controller.go`, `service.go`). Excellent acknowledged-limitations section. No false positives.

### Summary

| Metric | Claude Opus 4.6 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 10:04 | 7:39 | 5:46 |
| Critical | 0 | 0 | 0 |
| Important | 1 | 2 | 0 |
| Suggestion | 3 | 0 | 1 |
| Design observations | 5 | 3 | 3 |
| False positives | 0 | 0 | 0 |
| Unique insights | 3 | 2 | 1 |
| Files reviewed | 26 | 26 | 26 |
| Coverage misses | 1 (service.go) | 1 (limavm_controller.go) | 2 (service.go, limavm_controller.go) |

Codex provided the highest-value finding (StopWithWait PID recycling), Claude provided the broadest coverage with the most unique insights, and Gemini provided fast confirmation with good documentation of acknowledged limitations but under-flagged substantive issues. The multi-agent approach was essential: no single agent found all the important issues.

---

## Appendix: Original Reviews

### Claude Opus 4.6

Now let me write the final review.

---

## Executive Summary

This PR moves Windows BATS integration tests from WSL2 to MSYS2, fixing flaky `binfmt_misc` interop failures, and implements Windows-specific process isolation so hostagent management no longer kills the RDD service. The core changes — `process.SetGroup`/`Interrupt`/`KillTree` on Windows, `stopInstanceForcibly`, SSH key pre-generation — are well-designed and address real platform asymmetries. The code is thorough about crash-recovery state enumeration. **Verdict: Approve with minor suggestions.**

## Findings

### Critical Issues

None.

### Important Issues

1. **`shutdownAllHostagents` silently ignores `store.Inspect` failure** — `limavm_controller.go:519-523` (important, regression)

```go
cleanupCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
inst, err := store.Inspect(cleanupCtx, name)
if err == nil && inst != nil {
    stopInstanceForcibly(cleanupCtx, logger, inst)
}
cancel()
```

When `store.Inspect` fails (e.g., instance directory gone, permission error), the error is silently discarded — no log, no fallback kill. The old code (commit `776b8f77`) used `state.cmd.Process.Kill()` which always reached the process via its OS handle. The new code's `stopInstanceForcibly` is more capable (tree kill, WSL2 termination, tmp cleanup) but requires a successful Inspect. In practice, `exec.CommandContext` already killed the hostagent leader when the manager context was cancelled, so only child-process cleanup is lost. The system self-heals on next start via `killOrphanedHostagent`. Still, the silent error drop makes debugging harder.

Fix: Log the Inspect error and, as a belt-and-suspenders fallback, kill the process tree via the handle we already hold:

```diff
 inst, err := store.Inspect(cleanupCtx, name)
 if err == nil && inst != nil {
     stopInstanceForcibly(cleanupCtx, logger, inst)
+} else if err != nil {
+    logger.Info("Failed to inspect instance for forceful stop, falling back to direct kill",
+        "instance", name, "error", err)
+    if state.cmd != nil && state.cmd.Process != nil {
+        _ = process.KillTree(cleanupCtx, state.cmd.Process.Pid)
+    }
 }
```

### Suggestions

1. **`process` package has no unit tests** — `pkg/util/process/process_unix.go`, `process_windows.go` (suggestion, gap)

`SetGroup`, `Interrupt`, `Kill`, and `KillTree` are all new functions introduced by this PR. The BATS integration tests exercise them end-to-end, but targeted unit tests would catch regressions faster (e.g., verifying `KillTree` returns nil for a dead process, `SetGroup` preserves existing `SysProcAttr` fields). The Unix variants are straightforward to test; the Windows variants could at least verify the `taskkillExitNotFound` constant handling.

Fix: Add `process_test.go` with subtests for each function. Even basic "start a process, kill it, verify it's dead" tests would be valuable.

2. **`limavm-instance.bats` missing wslservice.exe recovery on delete wait** — `limavm-instance.bats:165` (suggestion, gap)

`limavm-running.bats` wraps `rdd ctl wait --for=delete` with `timeout 90` and kills `wslservice.exe` on hang (line 711-716). `limavm-instance.bats` uses a plain `rdd ctl wait --for=delete ... --timeout=60s` at line 165. If `wsl.exe --unregister` deadlocks `wslservice.exe` during the instance test's delete, the test suite blocks until the CI job times out. The risk is lower (instance tests don't boot VMs), but the defensive pattern costs nothing to add.

3. **`handleDeletion` delete timeout derives from reconciler context** — `limavm_lifecycle.go:79` (suggestion, gap)

```go
deleteCtx, deleteCancel := context.WithTimeout(ctx, 60*time.Second)
```

`ctx` is the reconciler context. If `stopInstanceForcibly` + `terminateWSL2Distro` consumed significant time before reaching this line, the effective timeout may be much less than 60s. Using `context.Background()` as the parent (like `shutdownHostagent`'s `forceStop` closure) would guarantee the full 60s. The reconciler will retry on failure either way, so the impact is limited.

## Design Observations

### Concerns

None substantive. The architecture is sound.

### Strengths

- **Process isolation model** is well-designed: `CREATE_NEW_PROCESS_GROUP` on Windows mirrors Unix `Setpgid`, and `process.Interrupt` maps SIGINT/CTRL_BREAK correctly per platform. The `signalHostagent` simplification (removing the `os.Signal` parameter, which was always `os.Interrupt`) eliminates a flexibility that was never used and wouldn't work cross-platform.

- **`stopInstanceForcibly`** is a significant improvement over the old `limainstance.StopForcibly`. It handles the full cleanup lifecycle (tree kill → disk unlock → WSL2 terminate → tmp file cleanup) with proper state guards (`allKilled` controls whether file cleanup proceeds). The `filenames.TmpFileSuffixes` usage and `os.ReadDir` over `filepath.Glob` (avoiding bracket meta-character issues) show attention to Windows path edge cases.

- **SSH key pre-generation** (`ensureSSHKeysAt`) carefully enumerates all partial-state combinations (missing pub, corrupt priv, orphaned pub, stale temps) and the test suite covers each. The generate-then-rename pattern prevents partial keys from blocking future attempts.

- **`shutdownHostagent` consolidation** — the old code had three separate implementations of "signal, wait, force-kill" in `stopInstance`, `handleRestartNeeded`, and `killOrphanedHostagent`. Consolidating into `shutdownHostagent` eliminates that duplication and fixes the old `shutdownAllHostagents` hang bug (unbounded `<-state.procExited` wait replaced by 10s timeout).

- **`setVerbosityFromEnv`** bridges the logrus/klog level gap cleanly, and calling it before `flag.Parse()` preserves explicit `-v` precedence.

## Testing Assessment

Test coverage is solid. The BATS tests were systematically updated: `skip_on_windows` removed, cross-platform helpers added (`assert_process_alive`, `process_kill`, `editor_cmd`, `vm_template`), and WSL cleanup added to teardown. Untested scenarios, ranked by risk:

1. `shutdownHostagent`'s force-stop timeout path (requires a hostagent that ignores shutdown signals) — acknowledged in code comment at line 548-550
2. `process.KillTree` behavior when hostagent is dead but children survive on Windows — documented in `process_windows.go:74-79`
3. Stale PID files on Windows pointing to recycled processes — acknowledged in `limavm_lifecycle.go:48-52`
4. `StopWithWait` when Interrupt succeeds but the event isn't delivered (no shared console on Windows) — documented in `service.go:346-351`

## Documentation Assessment

Changed code has thorough inline documentation. The platform-specific comments (WSL2 deadlock, PID recycling, CTRL_BREAK behavior) are particularly valuable for future maintainers. No external docs need updating.

## Commit Structure

The commit history is clean and logical: each commit represents a self-contained concept (process signaling, WSL2 cleanup, SSH keys, MSYS2 CI, log level propagation, app controller split). The `Address review #N` commits address prior review feedback and are appropriately separate. The final `Split app controller Lima template by platform` commit is cleanly isolated. No fixup commits that should have been squashed.

## Acknowledged Limitations

1. **PID recycling on Windows** — `limavm_lifecycle.go:48-52`: "Broken instances may have stale PID files pointing to recycled processes on Windows (Lima's ReadPIDFile treats any live PID as valid)." Also at `limavm_lifecycle.go:215-224`. The change makes this more visible by documenting it explicitly rather than relying on Lima's behavior. Impact: unchanged by this PR; Windows PID recycling was always a risk.

2. **taskkill /T dead-parent asymmetry** — `process_windows.go:74-79`: "if the target process is already dead, taskkill /T returns exit code 128 (treated as success), but surviving children... are not killed." Acknowledged with "Windows Job Objects would fix this if needed."

3. **Sequential hostagent shutdown** — `limavm_controller.go:504-505`: "TODO: Wait on all hostagents in parallel instead of sequentially." Pre-existing, unchanged by this PR.

4. **Force-stop timeout path not tested** — `limavm_lifecycle.go:548-550`: "Not tested: the forceStop fallback requires a hostagent that ignores shutdown signals."

5. **Hostagent shutdown mechanism not tested on Windows** — `limavm-running.bats:261-264`: "Not tested: whether the orphaned hostagent shuts down gracefully (via CTRL_BREAK/SIGINT) or is force-killed."

## Coverage Summary

| File | Status |
|------|--------|
| `.github/actions/spelling/expect/tech.txt` | Trivial |
| `.github/actions/spelling/expect/tools.txt` | Trivial |
| `.github/workflows/bats.yaml` | Reviewed, no issues |
| `bats/helpers/load.bash` | Trivial |
| `bats/helpers/os.bash` | Reviewed, no issues |
| `bats/helpers/utils.bash` | Reviewed, no issues |
| `bats/helpers/vm_template.bash` | Reviewed, no issues |
| `bats/tests/32-app-controllers/app.bats` | Reviewed, no issues |
| `bats/tests/33-lima-controllers/limavm-instance.bats` | Finding S2 |
| `bats/tests/33-lima-controllers/limavm-running.bats` | Reviewed, no issues |
| `pkg/controllers/app/app/controller.go` | Reviewed, no issues |
| `pkg/controllers/app/app/lima-images-unix.yaml` | Reviewed, no issues |
| `pkg/controllers/app/app/lima-images-wsl2.yaml` | Reviewed, no issues |
| `pkg/controllers/app/app/lima-template.yaml` | Reviewed, no issues |
| `pkg/controllers/lima/limavm/controllers/hostagent_watcher.go` | Reviewed, no issues |
| `pkg/controllers/lima/limavm/controllers/limavm_controller.go` | Finding 1 |
| `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` | Finding S3 |
| `pkg/controllers/lima/limavm/controllers/ssh_keys.go` | Reviewed, no issues |
| `pkg/controllers/lima/limavm/controllers/ssh_keys_test.go` | Reviewed, no issues |
| `pkg/controllers/lima/limavm/controllers/ssh_keys_unix.go` | Reviewed, no issues |
| `pkg/controllers/lima/limavm/controllers/ssh_keys_windows.go` | Reviewed, no issues |
| `pkg/external/main.go` | Reviewed, no issues |
| `pkg/service/cmd/service.go` | Reviewed, no issues |
| `pkg/util/process/process_unix.go` | Finding S1 |
| `pkg/util/process/process_windows.go` | Finding S1 |
| `scripts/collect-bats-logs.sh` | Reviewed, no issues |


### Codex GPT 5.4

**Executive Summary**

This PR replaces Windows WSL-hosted BATS execution with MSYS2, adds Windows-specific Lima/App templates and SSH-key bootstrapping, and rewrites several Lima shutdown paths to avoid console-wide `CTRL_BREAK` behavior. The overall direction is good, but I found two Windows PID-identity problems: one new regression in `service.StopWithWait()` and one remaining gap in the Lima force-stop fallbacks that still reuse stale PID snapshots after a long grace wait.

**Critical Issues**

None.

**Important Issues**

1. **Timeout fallback can kill a recycled PID on Windows** — [pkg/service/cmd/service.go:368](/Users/jan/git/rancher-desktop-daemon/pkg/service/cmd/service.go#L368) (important, regression)

```go
case <-timeout:
    // Graceful shutdown timed out; terminate so we don't leave
    // a hung service process.
    _ = process.Kill(pid)
    return fmt.Errorf("timed out waiting for %q control plane with pid %d to stop", instance.Name(), pid)
case <-ticker.C:
    if !Running() {
```

`StopWithWait()` captures `pid := PID()` at line 340, but its liveness checks go back through `PID()` at [pkg/service/cmd/service.go:133](/Users/jan/git/rancher-desktop-daemon/pkg/service/cmd/service.go#L133)-155. On Windows that helper treats any live process with the same numeric PID as "still running" because `os.FindProcess()` succeeding is enough there (lines 141-148). If the service exits after the initial graceful signal, leaves its PID file behind, and Windows recycles that PID before the 60-second timeout, line 375 will call `TerminateProcess` on an unrelated process. The new timeout kill at line 375 is what turns the pre-existing stale-PID ambiguity into an actual third-party kill.

Fix: before the timeout kill, re-read `instance.PIDFile()` and only kill if it still contains the same PID from line 340; better yet, remove the PID file during clean shutdown and/or keep a process handle / generation token so stop logic never has to trust a bare numeric PID on Windows.

2. **The Windows force-stop fallbacks still reuse stale PID snapshots after waiting 30s** — [pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:521](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go#L521) (important, gap)

```go
forceInst := inst
if forceInst == nil {
    forceInst, err = store.Inspect(forceCtx, name)
    ...
}
stopInstanceForcibly(forceCtx, logger, forceInst)
...
if inst.HostAgentPID > 0 {
    if err := process.Interrupt(inst.HostAgentPID); err != nil {
```

`shutdownHostagent()` reuses the `inst` snapshot captured before the 30-second graceful wait at lines 521-533 and 545-554, and `killOrphanedHostagent()` does the same after signaling and waiting at lines 573-587. On Windows, Lima's `ReadPIDFile()` treats any live recycled PID as valid ([store/instance.go:193](/Users/jan/go/pkg/mod/github.com/lima-vm/lima/v2@v2.0.3/pkg/store/instance.go#L193)-211), so once that grace period elapses the cached `HostAgentPID` / `DriverPID` may already belong to a different process by the time `stopInstanceForcibly()` runs. The deletion path in the same file already added a Windows-specific guard at [pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:45](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go#L45)-71; these fallback paths are now the remaining unsafe instances.

Fix: mirror the deletion logic here. Immediately before any forced kill, re-run `store.Inspect()`, drop PID-based kills for `StatusBroken`, and refuse to kill if the current PID files no longer match the snapshot you signaled earlier. Longer term, Job Objects or retained process handles would remove this class of bug entirely.

**Suggestions**

None.

**Design Observations**

**Concerns**

1. Process lifecycle on Windows still depends on pidfiles plus console/process-group semantics. Using Job Objects for hostagents and the service subprocess tree would eliminate recycled-PID ambiguity, make descendant cleanup deterministic, and remove the `taskkill /T` dead-parent asymmetry. `(future)`

**Strengths**

1. Switching Windows BATS from WSL interop to MSYS2 addresses the flaky layer directly instead of adding more retries around it.
2. The SSH-key pre-generation code is well-contained and handles crash-left temp/orphan states carefully, including Windows `os.Rename` semantics.

**Testing Assessment**

1. No test exercises the Windows timeout fallback in `shutdownHostagent()` / `killOrphanedHostagent()` where the forced path runs after a long grace wait with potentially stale PID files.
2. No Windows test covers the new `service.StopWithWait()` timeout-kill path against stale PID files / PID reuse.
3. The new code compiles and the targeted Go tests passed: `go test ./pkg/controllers/lima/limavm/controllers ./pkg/util/process ./pkg/external ./pkg/service/cmd ./pkg/controllers/app/app/...`.

**Documentation Assessment**

`docs/design/api_lima.md` is now stale: the "Normal Stop" and "Orphan Recovery" sections still describe `SIGTERM`/`SIGKILL`, while the implementation now uses `process.Interrupt()` (`SIGINT` on Unix, `CTRL_BREAK_EVENT` on Windows) plus `KillTree()`.

**Acknowledged Limitations**

1. [pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:216](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go#L216): "On Windows, StatusBroken instances may have stale PID files whose PIDs were recycled to unrelated processes." The change documents this risk in `handleWatchedState`; it still matters elsewhere, as noted above.
2. [pkg/util/process/process_windows.go:69](/Users/jan/git/rancher-desktop-daemon/pkg/util/process/process_windows.go#L69): "`taskkill /T` returns exit code 128 ... surviving children ... are not killed because taskkill cannot traverse the tree from a dead parent." The comment correctly captures the remaining dead-parent limitation; Job Objects would be the clean fix.
3. [pkg/controllers/lima/limavm/controllers/limavm_controller.go:500](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_controller.go#L500): "TODO: Wait on all hostagents in parallel..." This affects shutdown latency with many VMs, not correctness.

**Coverage Summary**

- `.github/actions/spelling/expect/tech.txt` — Trivial
- `.github/actions/spelling/expect/tools.txt` — Trivial
- `.github/workflows/bats.yaml` — Reviewed, no issues
- `bats/helpers/load.bash` — Reviewed, no issues
- `bats/helpers/os.bash` — Reviewed, no issues
- `bats/helpers/utils.bash` — Reviewed, no issues
- `bats/helpers/vm_template.bash` — Reviewed, no issues
- `bats/tests/32-app-controllers/app.bats` — Reviewed, no issues
- `bats/tests/33-lima-controllers/limavm-instance.bats` — Reviewed, no issues
- `bats/tests/33-lima-controllers/limavm-running.bats` — Reviewed, no issues
- `pkg/controllers/app/app/controller.go` — Reviewed, no issues
- `pkg/controllers/app/app/lima-images-unix.yaml` — Reviewed, no issues
- `pkg/controllers/app/app/lima-images-wsl2.yaml` — Reviewed, no issues
- `pkg/controllers/app/app/lima-template.yaml` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/hostagent_watcher.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/limavm_controller.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` — Finding 2
- `pkg/controllers/lima/limavm/controllers/ssh_keys.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/ssh_keys_test.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/ssh_keys_unix.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/ssh_keys_windows.go` — Reviewed, no issues
- `pkg/external/main.go` — Reviewed, no issues
- `pkg/service/cmd/service.go` — Finding 1
- `pkg/util/process/process_unix.go` — Reviewed, no issues
- `pkg/util/process/process_windows.go` — Reviewed, no issues
- `scripts/collect-bats-logs.sh` — Reviewed, no issues


### Gemini 3.1 Pro

### Executive Summary

This PR enables Lima BATS integration tests on Windows by migrating the test runner from WSL2 to MSYS2, completely bypassing WSL2 interop flakiness. It overhauls process management across the daemon, ensuring hostagents are grouped securely (`CREATE_NEW_PROCESS_GROUP`), graceful shutdown signals use group-targeted `CTRL_BREAK_EVENT` on Windows, and zombie WSL2 distros and leftover PID files are forcefully cleaned up to prevent `wslservice.exe` deadlocks. The changes are exceptionally well-crafted, explicitly handling platform asymmetries and edge cases safely.

### Critical Issues

None.

### Important Issues

None.

### Suggestions

1. **Defensive string concatenation for YAML templates** — `pkg/controllers/app/app/controller.go:32-35` (suggestion, enhancement)

```go
func limaTemplateData() string {
	images := limaImagesUnix
	if runtime.GOOS == "windows" {
		images = limaImagesWSL2
	}
	return images + limaTemplate
}
```

The embedded YAML strings are concatenated directly. If a future editor modifies `lima-images-wsl2.yaml` or `lima-images-unix.yaml` and accidentally drops the trailing newline, the resulting YAML will be corrupted (e.g., `mountType: wsl2cpus: 2`).

Fix: Add an explicit newline during concatenation. YAML parsers safely ignore redundant empty lines.

```diff
 func limaTemplateData() string {
 	images := limaImagesUnix
 	if runtime.GOOS == "windows" {
 		images = limaImagesWSL2
 	}
-	return images + limaTemplate
+	return images + "\n" + limaTemplate
 }
```

### Design Observations

**Strengths**
- **Robust Process Fallbacks:** The use of `unix.Kill(-pid, unix.SIGKILL)` with an immediate fallback to `unix.Kill(pid, unix.SIGKILL)` gracefully handles the edge case where a process didn't inherit its expected process group, ensuring nothing escapes the kill tree.
- **Resilient SSH Key Generation:** `ensureSSHKeysAt` handles every possible corrupted disk state seamlessly — missing private keys, missing public keys, stale `.tmp` files, and `os.Rename` collision constraints on Windows.
- **Graceful Timeout Escapes:** Constructing derived context timeouts (e.g., `terminateWSL2Distro`) from an already-bounded `cleanupCtx` correctly isolates individual step limits without breaching the overarching cleanup deadline.

### Testing Assessment

The test coverage is comprehensive and thoughtfully simulates complex lifecycle failures. A few nuanced scenarios remain unisolated, as noted by the author:

1. **Hostagent Signal Resistance (Medium Risk):** The `stopInstanceForcibly` timeout fallback path (when a hostagent stubbornly ignores graceful signals) is exercised indirectly during orphaned tests but lacks a dedicated scenario isolating the timeout behavior.
2. **Orphan Shutdown Mechanism (Low Risk):** The orphaned hostagent integration test successfully verifies that recovery occurs, but does not assert whether the orphan exited gracefully or required a forced kill.

### Documentation Assessment

The code is exceptionally well documented. All public APIs, platform asymmetries, and non-obvious workarounds feature thorough Godoc explanations explaining *why* constraints exist (e.g. `wslservice.exe` deadlocks, PID recycling risks, `taskkill` exit codes).

### Commit Structure

The commits are atomic, strictly sequenced, and provide detailed motivations in their messages that perfectly map to the codebase changes.

### Acknowledged Limitations

- **Stale PID files and process recycling on Windows:** "On Windows, StatusBroken instances may have stale PID files whose PIDs were recycled to unrelated processes. stopInstanceForcibly uses taskkill, which kills by PID without verifying the process identity." (`pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:216`). Mitigated during `Delete` by clearing PIDs, but remains a known risk during self-healing restarts.
- **Taskkill limitations for dead parents:** "Platform asymmetry: if the target process is already dead, taskkill /T returns exit code 128 (treated as success), but surviving children (e.g., SSH port forwarders) are not killed because taskkill cannot traverse the tree from a dead parent." (`pkg/util/process/process_windows.go:73`). Acknowledged as acceptable since orphaned port forwarders cannot rebind and are harmless.
- **Sequential hostagent waits:** "Wait on all hostagents in parallel instead of sequentially; with the current loop, the total wait is N × gracefulShutdownTimeout in the worst case." (`pkg/controllers/lima/limavm/controllers/limavm_controller.go:504`).
- **Console limitations for Windows signals:** "On Windows, Interrupt uses GenerateConsoleCtrlEvent, which fails when caller and target lack a shared console; Kill (TerminateProcess) then bypasses graceful shutdown." (`pkg/service/cmd/service.go:347`). The limitation correctly relies on `killOrphanedHostagent` to self-heal on the next launch.

### Coverage Summary

- `.github/actions/spelling/expect/tech.txt` — Trivial
- `.github/actions/spelling/expect/tools.txt` — Trivial
- `.github/workflows/bats.yaml` — Reviewed, no issues
- `bats/helpers/load.bash` — Trivial
- `bats/helpers/os.bash` — Reviewed, no issues
- `bats/helpers/utils.bash` — Reviewed, no issues
- `bats/helpers/vm_template.bash` — Reviewed, no issues
- `bats/tests/32-app-controllers/app.bats` — Reviewed, no issues
- `bats/tests/33-lima-controllers/limavm-instance.bats` — Reviewed, no issues
- `bats/tests/33-lima-controllers/limavm-running.bats` — Reviewed, no issues
- `pkg/controllers/app/app/controller.go` — **Finding 1**
- `pkg/controllers/app/app/lima-images-unix.yaml` — Reviewed, no issues
- `pkg/controllers/app/app/lima-images-wsl2.yaml` — Reviewed, no issues
- `pkg/controllers/app/app/lima-template.yaml` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/hostagent_watcher.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/limavm_controller.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/ssh_keys.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/ssh_keys_test.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/ssh_keys_unix.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/ssh_keys_windows.go` — Reviewed, no issues
- `pkg/external/main.go` — Reviewed, no issues
- `pkg/service/cmd/service.go` — Reviewed, no issues
- `pkg/util/process/process_unix.go` — Reviewed, no issues
- `pkg/util/process/process_windows.go` — Reviewed, no issues
- `scripts/collect-bats-logs.sh` — Reviewed, no issues

