# Deep Review: 20260318-173320-msys2

| | |
|---|---|
| **Date** | 2026-03-18 17:33 |
| **Branch** | `msys2` |
| **Commits** | `2e14c80` Fix Windows process signaling for Lima hostagent lifecycle<br>`1104a76` Fix stopInstanceForcibly to terminate WSL2 distro and clean up files<br>`2d01084` Enable Lima BATS tests on Windows |
| **Reviewers** | Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro |
| **Verdict** | **Merge as-is** -- remaining items deferred to upstream Lima fixes |
| **Wall-clock time** | 14 min 56 s |
| **Updated** | 2026-03-18 -- findings fixed post-review; struck-through items are resolved. Agent timings and retro reflect the original review, not the updated state. |

---

## Consolidated Review

### Executive Summary

This branch enables Lima BATS tests on Windows (MSYS2/WSL2) by replacing Lima's upstream `StopForcibly` -- whose `GenerateConsoleCtrlEvent(CTRL_BREAK)` kills the entire console group, including the RDD service -- with a custom `stopInstanceForcibly` that uses per-process `TerminateProcess`. The BATS tests gain cross-platform process helpers, WSL2-specific templates, and distro cleanup in teardown. The implementation solves the core Windows problem. The principal regression is that `killOrphanedHostagent` now force-kills on all platforms, removing the graceful shutdown path that Unix previously had.

**Structure:** 3 commits, 9 files, +220/-79 lines. Go changes in `limavm_lifecycle.go` (shutdown paths), `hostagent_watcher.go` (signalHostagent refactor), and `process_{unix,windows}.go` (new `Interrupt`/`Kill` helpers). BATS changes in `os.bash` (cross-platform process helpers), `utils.bash`, `limavm-instance.bats`, and `limavm-running.bats`.

### Critical Issues

None.

### Important Issues

1. **Orphaned hostagents force-killed without graceful shutdown on Unix** -- `limavm_lifecycle.go:507-509` [Gemini 3.1 Pro] *(deferred to upstream)*

The old code called `limainstance.StopGracefully` with a 30-second timeout, then fell back to `StopForcibly`. The new code calls `stopInstanceForcibly` directly, force-killing the hostagent and driver on all platforms. On Unix, this removes the guest's opportunity to sync filesystems and shut down cleanly. The change is necessary on Windows (where `StopGracefully` calls `SysKill`, which crashes the RDD service), but the graceful path should be preserved on Unix. *(important, regression)*

Deferred: will be resolved by fixing Lima's `SysKill` upstream and switching back to `limainstance.StopGracefully`.

~~2. **Signal failure leaves VM driver alive** -- FIXED: both `stopInstance` and `handleRestartNeeded` now call `stopInstanceForcibly` after `killHostagent`.~~

~~3. **Unbounded `wsl.exe` execution blocks the reconciler** -- FIXED: `stopInstanceForcibly` now uses `exec.CommandContext` with a 30-second timeout.~~

### Suggestions

1. **Missing `AdditionalDisks` unlock in `stopInstanceForcibly`** [Codex GPT 5.4] *(deferred to upstream)*

Lima's upstream `StopForcibly` unlocks `inst.AdditionalDisks` before killing processes. The custom `stopInstanceForcibly` omits this. RDD does not use `AdditionalDisks`, so the omission has no practical impact. Will be resolved when switching back to upstream `StopForcibly`. *(suggestion, gap)*

~~2. **Stale comment claims `os.Interrupt` is unsupported on Windows** -- FIXED: updated to "process already exited or no console".~~

~~3. **`editor_cmd` guards `cygpath` with `is_windows` instead of `is_msys`** -- FIXED: changed to `is_msys`.~~

~~4. **`shutdownAllHostagents` doc comment uses Unix-specific terms** -- FIXED: updated to platform-neutral terms.~~

2. **`filepath.Glob` less robust than Lima's `ReadDir` approach** -- `limavm_lifecycle.go:537-541` [Gemini 3.1 Pro] *(deferred to upstream)*

`filepath.Glob` fails silently if `inst.Dir` contains glob metacharacters. Lima's `StopForcibly` uses `os.ReadDir` + `strings.HasSuffix`. Low risk since instance paths derive from Kubernetes resource names. Will be resolved when switching back to upstream `StopForcibly`. *(suggestion, gap)*

3. **PID recycling risk with `os.FindProcess`** -- `limavm_lifecycle.go:524` [Gemini 3.1 Pro]

On Windows, `os.FindProcess(pid)` always succeeds even if the original process has terminated. If the OS recycled the PID, `p.Kill()` terminates an unrelated process. Inherent to PID-based approaches; existed in Lima's original `SysKill`. *(suggestion, gap)*

---

## Design Observations

1. **Platform-specific force-stop strategy** *(in-scope)* [Codex GPT 5.4, Gemini 3.1 Pro]

The Windows workaround replaces Lima's stop helpers on all platforms. A cleaner design isolates the Windows-specific `TerminateProcess` logic behind a platform build tag while preserving Lima's upstream behavior on Unix. This eliminates a class of drift bugs (like the missing disk unlock) and keeps RDD aligned with Lima's evolving cleanup logic on platforms where Lima works correctly. The branch already demonstrates this pattern with `process_unix.go` / `process_windows.go`.

2. **Console-less Windows service deployment** *(future)* [Claude Opus 4.6]

`process.Interrupt` uses `GenerateConsoleCtrlEvent`, which requires the target process to share a console with the caller. When RDD runs as a headless Windows service, `GenerateConsoleCtrlEvent` may fail, making the `killHostagent` fallback the default path. Consider an alternative graceful shutdown mechanism for headless operation (named event, socket-based command) as Windows deployment matures.

3. **`shutdownAllHostagents` timeout path lacks WSL2 cleanup** *(future)* [Claude Opus 4.6]

~~Signaling: `shutdownAllHostagents` used `syscall.SIGINT` -- FIXED: now uses `process.Interrupt`.~~

The timeout path (limavm_controller.go:504-506) calls `state.cmd.Process.Kill()`, which terminates only the hostagent; on WSL2, the distro survives. Pre-existing but relevant as this branch enables Windows support.

---

## Testing Assessment

The BATS test adaptation is thorough. The changes replace bare `kill` calls with cross-platform `assert_process_alive` and `process_kill` helpers, add WSL2-specific templates with tarball images, and clean up WSL2 distros in `local_teardown_file`. The Windows process table visibility workaround (skipping PID death check after graceful shutdown on Windows) is well-documented.

Untested scenarios, ranked by risk:

1. **Orphaned hostagent graceful shutdown on Unix**: The orphaned hostagent test verifies the old PID dies and a new one starts, but does not verify whether shutdown was graceful or forced. This masks the regression in finding #1.
2. **Headless Windows service mode**: All BATS tests run from an MSYS2 shell with a console. No test validates behavior without a console window.

---

## Documentation Assessment

The `stopInstanceForcibly` function comment explains the motivation clearly. ~~Stale comments in findings #2 and #4~~ -- FIXED. If the force-stop logic remains custom, it should document which Lima upstream cleanup steps are intentionally omitted and why (e.g., `AdditionalDisks` unlock).

---

## Agent Performance Retro

### Claude Opus 4.6

Claude delivered the most nuanced analysis. It correctly identified the signal-failure fallback gap and traced the full self-healing path through `handleUnwatchedState`, arriving at an accurate IMPORTANT rating. It caught the stale comment, the `cygpath` guard issue, and the `shutdownAllHostagents` doc comment -- all valid, specific findings. It also raised the console-less deployment design concern. Claude missed the orphaned hostagent graceful-shutdown regression that Gemini caught, and it did not examine Lima's upstream `StopForcibly` for omitted cleanup (Codex's finding).

### Codex GPT 5.4

Codex focused on a single finding: the missing `AdditionalDisks` unlock. It traced into Lima's upstream `StopForcibly` source code and the `vz` driver to identify the disk-locking regression path. The analysis was thorough but the finding has zero practical impact because RDD does not use `AdditionalDisks`. Codex produced no other findings, missing the signal-failure fallback, the orphaned hostagent regression, and the `wsl.exe` timeout gap.

### Gemini 3.1 Pro

Gemini raised the most findings (2 critical, 2 important, 1 suggestion) and caught the orphaned hostagent graceful-shutdown regression that neither Claude nor Codex identified. It also caught the unbounded `wsl.exe` call, a genuine gap. However, Gemini over-rated severity: both CRITICAL findings are IMPORTANT in practice. For finding #1, Gemini acknowledged the Windows constraint but rated severity as if the regression applied equally across platforms. For finding #2, Gemini claimed the system "silently" enters a broken state without tracing the self-healing reconcile loop. The `filepath.Glob` metacharacter concern is technically valid but low-risk.

### Summary Table

| Metric | Claude Opus 4.6 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 10:14 | 4:14 | 5:34 |
| Critical | 0 | 0 | 2 |
| Important | 1 | 1 | 2 |
| Suggestion | 3 | 0 | 1 |
| Design observations | 2 | 1 | 1 |
| False positives | 0 | 0 | 0 |
| Unique insights | 3 | 1 | 2 |

All three agents contributed unique findings. No agent produced outright false positives, though Codex's finding has zero practical impact and Gemini over-rated severity. Claude provided the most accurate analysis overall. Gemini caught the most important regression (orphaned hostagent). Codex provided the least value this round -- one finding with no real-world impact, in the fastest time.

---

## Appendix: Original Reviews

### Claude Opus 4.6

## Executive Summary

This branch adds Windows (MSYS2/WSL2) support for Lima's hostagent lifecycle management and BATS integration tests. The core change replaces Lima's upstream `StopForcibly` (which uses `GenerateConsoleCtrlEvent(CTRL_BREAK)` targeting the entire console group, killing the RDD service alongside the hostagent) with a custom `stopInstanceForcibly` that uses per-process `TerminateProcess`. The BATS tests are well-adapted with cross-platform process helpers and WSL2-specific templates. The implementation is solid; the key issue is incomplete cleanup in the signal-failure fallback path.

## Findings

### Critical Issues

None.

### Important Issues

1. **`killHostagent` fallback skips WSL2 distro termination and tmp file cleanup** -- `limavm_lifecycle.go:466-472` and `limavm_lifecycle.go:296-299` *(important, gap)*

```go
} else {
    // Signal delivery failed (e.g. os.Interrupt is not supported on Windows).
    // Kill the hostagent process directly via its process handle rather than
    // via StopForcibly (which kills by PID and cleans up files concurrently).
    logger.Info("Could not signal hostagent, killing process directly")
    r.killHostagent(limaVM.Name)
    r.waitForProcessExit(ctx, limaVM.Name)
}
```

When `signalHostagent` fails, both `stopInstance` (line 471) and `handleRestartNeeded` (line 298) call `r.killHostagent()`, which kills only the hostagent process via `state.cmd.Process.Kill()`. On WSL2, this leaves the distro running (the `nohup sleep` keepalive survives) and doesn't clean up `.pid`/`.sock`/`.tmp` files.

In `stopInstance`, the subsequent `store.Inspect` (line 478) reports `StatusRunning` (because the WSL2 distro is still alive), causing a `StopFailed` error at line 488. In `handleRestartNeeded`, the status is patched to "Stopped" without verifying -- the next reconcile finds a running orphan via `handleUnwatchedState` -> `killOrphanedHostagent` -> `stopInstanceForcibly`, which properly terminates the distro. So the issue is self-healing, but at the cost of noisy error logs and 2-3 extra reconcile cycles.

This matters beyond edge cases: when RDD runs as a headless Windows service (no console), `GenerateConsoleCtrlEvent` may fail regularly, making the `else` branch the default path.

Fix: After `killHostagent`, call a minimal WSL2 cleanup function -- terminate the distro and clean up tmp files. This could be extracted from `stopInstanceForcibly` since the PID-based kills are redundant (the process handle `Kill()` already did that). For `handleRestartNeeded`, calling `store.Inspect` + `stopInstanceForcibly` in the `else` branch would mirror the graceful-timeout fallback.

### Suggestions

2. **Stale comment: "os.Interrupt is not supported on Windows"** -- `limavm_lifecycle.go:467` *(suggestion, regression)*

```go
// Signal delivery failed (e.g. os.Interrupt is not supported on Windows).
```

This comment was written in the same commit (`2e14c80`) that implements `process.Interrupt` on Windows via `GenerateConsoleCtrlEvent`. The example scenario is immediately wrong -- `process.Interrupt` *is* supported on Windows. The `else` branch fires for other reasons (process already exited, no console, no watcher).

Fix: Change the example to something accurate, e.g., `"Signal delivery failed (e.g. process already exited or no console available)"`.

3. **`editor_cmd` uses `cygpath` guarded by `is_windows` instead of `is_msys`** -- `limavm-running.bats:112-113` *(suggestion, gap)*

```bash
editor_cmd() {
    local script=$1
    if is_windows; then
        echo "bash $(cygpath -m "${script}")"
    ...
```

`cygpath` is an MSYS2/Cygwin utility that does not exist on WSL2. Since `is_windows` returns true for both MSYS2 and WSL2 (per `os.bash` lines 14-15, 20-21), this would fail on a WSL2 BATS runner. Currently only MSYS2 runners are used, so this is theoretical.

Fix: Use `is_msys` instead of `is_windows`, or add a WSL2 branch that converts paths via `wslpath`.

4. **`shutdownAllHostagents` doc comment is stale** -- `limavm_controller.go:477-478` *(suggestion, gap)*

```go
// It sends SIGINT to each hostagent for graceful shutdown, waits for them to exit,
// and falls back to SIGKILL after a timeout.
```

The implementation now uses `process.Interrupt` (CTRL_BREAK on Windows, not SIGINT) and `state.cmd.Process.Kill()` (TerminateProcess on Windows, not SIGKILL). The comment should use platform-neutral terms or note the Windows behavior.

Fix: `"It sends a graceful shutdown signal to each hostagent, waits for them to exit, and falls back to process termination after a timeout."`

## Design Observations

1. **Console-less Windows service deployment** *(future)*

`process.Interrupt` on Windows uses `GenerateConsoleCtrlEvent`, which requires the target process to share a console with the caller. When RDD runs as a Windows service (no console window), the hostagent may inherit no console, causing `GenerateConsoleCtrlEvent` to fail. This makes the `killHostagent` fallback the default path in production service mode. The current design is correct for MSYS2 testing; as Windows deployment matures, consider an alternative graceful shutdown mechanism for headless operation (e.g., a named event that the hostagent polls, or a socket-based shutdown command).

2. **`shutdownAllHostagents` timeout path also lacks WSL2 cleanup** *(future)*

At `limavm_controller.go:504-506`, when a hostagent doesn't exit within `gracefulShutdownTimeout`, `state.cmd.Process.Kill()` terminates only the hostagent process. On WSL2, the distro would survive, leaking a running distro on service shutdown. This is pre-existing code (not changed by this PR except the signal line at 491) but worth noting as part of the same class of issues as finding #1.

## Testing Assessment

The test coverage is thorough. Both `limavm-instance.bats` and `limavm-running.bats` are properly adapted:

- WSL2-specific templates with tarball images (vs qcow2 for Unix)
- Cross-platform process helpers (`assert_process_alive`, `process_kill`) correctly use `tasklist.exe`/`taskkill.exe` on Windows
- WSL2 distro cleanup in `local_teardown_file`
- Windows-specific workarounds are well-documented (e.g., the process table visibility comment at line 356-358)

Untested scenarios, ranked by risk:

1. **`signalHostagent` failure path on WSL2**: No test exercises the `else` branch where `signalHostagent` returns false while a WSL2 distro is running. This is the path affected by finding #1.
2. **Headless Windows service mode**: All BATS tests run from an MSYS2 shell with a console. No test validates behavior when RDD has no console (the production deployment scenario on Windows).

## Documentation Assessment

The code comments are generally good. The `stopInstanceForcibly` function doc (lines 512-520) clearly explains the motivation for replacing Lima's upstream `StopForcibly`. The stale comments noted in findings #2 and #4 should be updated.

---

### Codex GPT 5.4

**Verdict**

Needs fixes before merge.

**Executive Summary**

These commits replace Lima's default process signaling and force-stop paths so Lima BATS can run on Windows/MSYS2, including WSL2-specific shutdown cleanup and Windows-aware test helpers. The main issue I found is that the new custom force-stop helper now replaces Lima's cleanup on every platform, but it dropped Lima's additional-disk unlock logic, which can break forced restart/orphan-recovery flows for VMs that use `additionalDisks` (especially on `vz`). I did not find a critical data-loss or security issue beyond that regression.

**Critical Issues**

None.

**Important Issues**

1. **Custom force-stop dropped Lima's additional-disk unlock cleanup** -- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:507-543` `(important, regression)`

```go
func (r *LimaVMReconciler) killOrphanedHostagent(_ context.Context, inst *limatype.Instance) error {
	stopInstanceForcibly(inst)
	return nil
}

func stopInstanceForcibly(inst *limatype.Instance) {
	for _, pid := range []int{inst.DriverPID, inst.HostAgentPID} {
		if pid > 0 {
			if p, err := os.FindProcess(pid); err == nil {
				_ = p.Kill()
```

`git blame` shows the new helper at lines 507-543 and its new call sites at lines 188, 292, and 463 were introduced by `2e14c80`/`1104a76`. The problem is that `stopInstanceForcibly()` at lines 521-542 now replaces `limainstance.StopForcibly()` in orphan recovery, restart fallback, and forced stop, but it only kills PIDs and removes tmp files. Lima's original implementation in stop.go:119-128 also unlocks every `inst.AdditionalDisks` entry before killing the hostagent. That unlock is required because the next start path consults disk locks: the `vz` driver rejects any pre-existing lock, so a forced restart/orphan recovery can strand a `vz` VM with "disk in use" even when it is the same instance. `qemu` is more forgiving (unlocks same-instance locks), but `vz` is not, so this is a real cross-platform regression introduced by broadening the Windows workaround into the generic force-stop path.

Fix: keep the Windows-specific termination logic scoped to Windows only, but preserve Lima's cleanup semantics. The safest option is to call `limainstance.StopForcibly(inst)` on non-Windows and use a Windows-only wrapper that additionally unlocks `inst.AdditionalDisks` before killing processes and terminating the WSL distro.

**Suggestions**

None.

**Design Observations**

1. The Windows workaround should be isolated behind a platform-specific force-stop implementation instead of replacing Lima's stop helper everywhere. Doing that keeps Rancher Desktop's Windows process handling fix while letting upstream Lima remain the source of truth for non-Windows cleanup, which removes a class of drift bugs like the dropped disk unlocks above. `(in-scope)`

**Testing Assessment**

1. There is no coverage for forced stop, orphan recovery, or restart on a VM configured with `additionalDisks`. A BATS case on macOS `vz` would have caught the stale-lock regression immediately.
2. There are still no controller unit tests in `pkg/controllers/lima/limavm/controllers`, so platform-specific shutdown behavior is only exercised through BATS. That leaves the Windows/non-Windows fallback branches harder to validate in isolation.
3. I did not see a test that forces the Windows `signalHostagent()` path to fail and verifies the direct-kill fallback. Current Windows BATS coverage assumes `CTRL_BREAK` delivery succeeds.

**Documentation Assessment**

No user-facing documentation gap stood out from these commits. The code comments explain the Windows/WSL2 intent, but if the force-stop logic is kept custom, it should explicitly document which parts of Lima's upstream cleanup must remain mirrored.

---

### Gemini 3.1 Pro

### Executive Summary

This PR successfully adapts process signaling and lifecycle management for Windows and WSL2, fixing BATS tests and introducing a `CREATE_NEW_PROCESS_GROUP` strategy to bypass Lima's destructive `SysKill` behavior on Windows. However, it introduces critical regressions in VM shutdown logic: orphaned hostagents are now forcefully hard-killed upon controller restart (risking data corruption), and failing to gracefully signal a hostagent leaves the VM driver (WSL2/QEMU) secretly running in the background.

### Findings

#### Critical Issues

1. **Orphaned hostagents are forcefully killed without graceful shutdown** -- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:504-507` `(critical, regression)`

```go
// killOrphanedHostagent terminates an orphaned hostagent (one running without a
// watcher, typically after a controller restart).
func (r *LimaVMReconciler) killOrphanedHostagent(_ context.Context, inst *limatype.Instance) error {
	stopInstanceForcibly(inst)
	return nil
}
```

The PR replaces `limainstance.StopGracefully` with an immediate `stopInstanceForcibly(inst)` for orphaned hostagents. While removing Lima's `StopGracefully` is necessary to prevent a crash loop on Windows (as its `SysKill` implementation targets the entire console group), completely dropping graceful shutdown is unsafe. It causes the controller to forcefully hard-kill running VMs (both the driver and hostagent) whenever the RDD daemon restarts, denying the guest OS the opportunity to sync filesystems and cleanly shut down. This poses a high risk of user data corruption.

**Fix:** Reimplement a safe graceful shutdown within `killOrphanedHostagent` using the new `process.Interrupt(inst.HostAgentPID)` to signal the hostagent's process group. Wait for the process to exit (or poll `store.Inspect` until it reports `Stopped`), and only fall back to `stopInstanceForcibly` if the timeout expires.

2. **Failing to signal the hostagent leaves the VM driver orphaned and leaks resources** -- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:466-473` `(critical, regression)`

```go
	} else {
		// Signal delivery failed (e.g. os.Interrupt is not supported on Windows).
		// Kill the hostagent process directly via its process handle rather than
		// via StopForcibly (which kills by PID and cleans up files concurrently).
		logger.Info("Could not signal hostagent, killing process directly")
		r.killHostagent(limaVM.Name)
		r.waitForProcessExit(ctx, limaVM.Name)
	}
```

In `stopInstance` (and identically around line 295 in `handleRestartNeeded`), if `r.signalHostagent` fails or returns false because the watcher state is lost, the fallback branch calls `r.killHostagent(limaVM.Name)`. `killHostagent` *only* terminates the hostagent process; it does not terminate the VM driver (QEMU or WSL2 keepalive) or clean up tmp files. Because `stopInstanceForcibly(inst)` is intentionally skipped in this branch, the driver process is left orphaned and running indefinitely. The instance silently enters a `Broken` state (hostagent dead, driver alive), which bypasses the `inst.Status == limatype.StatusRunning` error check, falsely marking the CR as `Stopped` while the VM continues to consume system resources.

**Fix:** In the `else` block, after killing the hostagent process (or if the state is missing), you must call `stopInstanceForcibly(inst)` to guarantee the driver is terminated and the instance directory is fully cleaned up.

#### Important Issues

1. **Unbounded execution of `wsl.exe` can permanently hang the reconciler** -- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:531-533` `(important, gap)`

```go
	if inst.VMType == limatype.WSL2 {
		distroName := "lima-" + inst.Name
		// Best-effort: ignore errors (distro may already be stopped).
		_ = exec.Command("wsl.exe", "--terminate", distroName).Run()
	}
```

The `wsl.exe` command is executed synchronously without a timeout. If the WSL subsystem enters a degraded or unresponsive state (a known failure mode for WSL2), `wsl.exe` can hang indefinitely. Because this is called directly within the reconcile loop, a hang will permanently block the worker goroutine, potentially starving the controller-runtime's worker pool and halting all VM lifecycle operations.

**Fix:** Use `context.WithTimeout` and `exec.CommandContext` to enforce a strict timeout (e.g., 15-30 seconds) on the `wsl.exe` execution.

2. **`filepath.Glob` usage breaks tmp file cleanup if paths contain metacharacters** -- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:536-540` `(important, regression)`

```go
	for _, suffix := range filenames.TmpFileSuffixes {
		matches, _ := filepath.Glob(filepath.Join(inst.Dir, "*"+suffix))
		for _, m := range matches {
			_ = os.Remove(m)
		}
	}
```

`filepath.Glob` treats characters like `[` and `]` as pattern metacharacters. If `inst.Dir` (which derives from `RDD_LIMA_HOME` or the user's profile path) contains these characters, the pattern will fail to match silently. The `.pid` and `.sock` files will be left behind in the instance directory, preventing the VM from successfully restarting until manually deleted.

**Fix:** Mirror Lima's original `StopForcibly` logic: use `os.ReadDir(inst.Dir)` to list files and `strings.HasSuffix(f.Name(), suffix)` to match them safely regardless of the directory path's contents.

#### Suggestions

1. **Unintended PID targeting risk on Windows** -- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:521-527` `(suggestion, gap)`
On Windows, `os.FindProcess(pid)` succeeds even if the process has terminated, returning a handle to whatever process currently holds that PID. If the original driver or hostagent crashed and the OS recycled the PID quickly, `p.Kill()` will forcefully terminate an unrelated application. While this risk is inherent to relying on PIDs recorded in files (and existed in Lima's `osutil.SysKill`), it is worth noting. Process handles (like `state.cmd.Process`) should be preferred over PIDs when available.

### Design Observations

**Duplicate cleanup logic vs upstreaming** `(future)`
The PR hand-rolls a simplified version of Lima's `StopForcibly` logic into `stopInstanceForcibly` because Lima's `osutil.SysKill` crashes the RDD service on Windows. Rather than maintaining a parallel implementation (which currently omits Lima's disk-unlocking logic and risks diverging over time), consider contributing the `CREATE_NEW_PROCESS_GROUP` strategy upstream to Lima. If `osutil.SysKill(..., osutil.SigInt)` worked safely on Windows, RDD could rely natively on Lima's `StopGracefully` and `StopForcibly` APIs again.

### Testing Assessment

The BATS tests were successfully updated to pass on Windows by replacing raw Unix commands with `assert_process_alive` and fixing WSL path conversions. However, test coverage has the following gaps (ranked by risk):

1. **Orphaned hostagent shutdown safety:** The test "restart service and verify orphaned hostagent recovery" verifies that the old PID is dead, but it does not verify *how* it died. Because of the regression in `killOrphanedHostagent`, this test passes even though the VM is now force-killed. A test should verify that a graceful shutdown occurred (e.g., by checking if a shutdown trap script fired inside the guest OS).
2. **Driver termination verification:** Tests do not verify that the driver (QEMU/WSL) is actually terminated when the hostagent crashes or fails to stop, which would have caught the leaked WSL distro bug in the `stopInstance` fallback block.
3. **WSL timeout resilience:** There is no test simulating a hung `wsl.exe` command to ensure the controller doesn't block forever.

### Documentation Assessment

The new `stopInstanceForcibly` function lacks comments explaining why it intentionally omits the disk-unlocking steps found in Lima's original `StopForcibly`. If RDD doesn't use `AdditionalDisks`, a comment should state this assumption so future maintainers don't introduce disks and subsequently encounter locked-disk errors during forceful stops.
