# Deep Review: 20260321-135724-pr-242

| | |
|---|---|
| **Date** | 2026-03-21 13:57 |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#242](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/242) — Run Lima BATS tests on Windows via MSYS2 |
| **Branch** | `msys2-ci` |
| **Commits** | `ddb5fbf` Fix Windows process signaling for Lima hostagent lifecycle<br>`2797682` Fix stopInstanceForcibly to terminate WSL2 distro and clean up files<br>`3011204` Enable Lima BATS tests on Windows<br>`71ede30` Pre-generate Lima SSH keys on Windows to avoid broken path conversion<br>`42dece9` Run Windows BATS tests via MSYS2 instead of WSL2<br>`440484b` Honor RDD_LOG_LEVEL in external controllers |
| **Reviewers** | Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro |
| **Verdict** | **Merge with fixes** — two bugs in the new SSH key bootstrap need attention before merge |
| **Wall-clock time** | 10 min 46 s |

---

## Consolidated Review

### Executive Summary

This PR makes Lima VM management work on Windows by isolating hostagent process groups (`CREATE_NEW_PROCESS_GROUP` + targeted `CTRL_BREAK_EVENT`), hardening WSL2 lifecycle management (terminate before unregister to prevent `wslservice.exe` deadlocks), and migrating Windows CI from WSL2-based BATS to MSYS2. The cross-platform process abstractions are clean and well-documented. The two findings that warrant fixes before merge are both in the new SSH key bootstrap: a leftover temp file can hang the controller on startup, and a missing public key triggers unnecessary private key deletion.

---

### Critical Issues

None.

---

### Important Issues

1. **Leftover temp file hangs SSH key generation on restart** — `ssh_keys_windows.go:49-58` [Claude Opus 4.6, Gemini 3.1 Pro] (important, regression)

```go
tmpPath := privPath + ".tmp"
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
cmd := exec.CommandContext(ctx, "ssh-keygen", "-t", "ed25519", "-q", "-N", "",
    "-C", "lima", "-f", tmpPath)
if out, err := cmd.CombinedOutput(); err != nil {
    _ = os.Remove(tmpPath)
    _ = os.Remove(tmpPath + ".pub")
    return fmt.Errorf("failed to generate SSH key: %s: %w", out, err)
}
```

If the controller crashes during `ssh-keygen` execution, the `.tmp` file persists. On next start, `ssh-keygen` detects the existing file and prompts `Overwrite (y/n)?`. Running without a TTY, it hangs until the 30-second timeout kills it. The function returns an error, `SetupWithManager` fails at line 452, and the lima-controller cannot start. Every subsequent restart hits the same hang until someone manually deletes the `.tmp` file.

Claude rated this Suggestion; Gemini rated it Critical. The trigger requires a crash during the ~1-second keygen window on first run (before keys exist). The impact is indefinite: the controller blocks on every startup until manual intervention. Important balances the narrow trigger against the persistent impact.

Fix: Clean up stale temp files before running `ssh-keygen`, and use `defer` for the post-failure cleanup.

```diff
 tmpPath := privPath + ".tmp"
+_ = os.Remove(tmpPath)
+_ = os.Remove(tmpPath + ".pub")
+defer os.Remove(tmpPath)
+defer os.Remove(tmpPath + ".pub")
 ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
 defer cancel()
 cmd := exec.CommandContext(ctx, "ssh-keygen", "-t", "ed25519", "-q", "-N", "",
     "-C", "lima", "-f", tmpPath)
 if out, err := cmd.CombinedOutput(); err != nil {
-    _ = os.Remove(tmpPath)
-    _ = os.Remove(tmpPath + ".pub")
     return fmt.Errorf("failed to generate SSH key: %s: %w", out, err)
 }
```

2. **Missing public key causes unnecessary private key deletion** — `ssh_keys_windows.go:32-38` [Codex GPT 5.4] (important, regression)

```go
if _, err := os.Stat(privPath); err == nil {
    if _, err := os.Stat(pubPath); err == nil {
        return nil
    }
    // Private key exists without public key — corrupt keypair.
    // Remove and regenerate.
    _ = os.Remove(privPath)
}
```

When only the public key is missing (external deletion, filesystem error, antivirus quarantine), the code deletes the valid private key and regenerates both. The guest VM's `authorized_keys` still trusts the old public key, so the new keypair breaks SSH access to existing VMs. A missing public key is recoverable: `ssh-keygen -y -f <privPath>` derives the public key from the private key.

`ensureSSHKeys` runs during `SetupWithManager` at line 452, so this executes on every controller startup. In practice, the trigger (pub key missing while priv key remains) is unusual, but the impact (VM unreachable over SSH) is severe.

Fix: Regenerate the public key from the existing private key instead of deleting both.

```diff
 if _, err := os.Stat(privPath); err == nil {
     if _, err := os.Stat(pubPath); err == nil {
         return nil
     }
-    // Private key exists without public key — corrupt keypair.
-    // Remove and regenerate.
-    _ = os.Remove(privPath)
+    // Derive the public key from the existing private key.
+    tmpPub := pubPath + ".tmp"
+    _ = os.Remove(tmpPub)
+    out, err := exec.Command("ssh-keygen", "-y", "-f", privPath).Output()
+    if err != nil {
+        return fmt.Errorf("failed to derive public key: %w", err)
+    }
+    if err := os.WriteFile(tmpPub, out, 0o644); err != nil {
+        return fmt.Errorf("failed to write public key: %w", err)
+    }
+    return os.Rename(tmpPub, pubPath)
 }
```

3. **Sequential hostagent shutdown multiplies timeout** — `limavm_controller.go:503-531` [Gemini 3.1 Pro] (important, gap)

```go
// TODO: Wait on all hostagents in parallel instead of sequentially; with the
// current loop, the total wait is N × gracefulShutdownTimeout in the worst case.
for name, state := range states {
    graceful := true
    select {
    case <-state.procExited:
    case <-time.After(gracefulShutdownTimeout):
```

The code already documents this with a TODO at line 503. With N hung hostagents, the worst case is N x 30 seconds. Windows SCM and systemd enforce shutdown timeouts (typically 30-90 seconds), so with two or more struggling instances, the OS kills the service before it finishes iterating. Trailing VMs miss the `stopInstanceForcibly` sweep, leaving unmanaged WSL2 distros running.

Fix: Wait on all hostagents in parallel with a shared timeout, as the TODO suggests.

4. **`rdd svc stop` orphans hostagents on Windows** — `process_windows.go:48`, `limavm-running.bats:342-345` [Codex GPT 5.4] (important, gap)

```bash
# On Unix, shutdownAllHostagents ran during graceful shutdown, so the
# hostagent is already dead. On Windows, TerminateProcess bypasses Go's
# signal handlers, so the hostagent survives as an orphan and is cleaned
# up by orphan recovery on the next start.
```

On Unix, `process.Kill` sends SIGTERM (line 32 of `process_unix.go`), which triggers Go's signal handler and runs `waitForShutdown` → `shutdownAllHostagents`. On Windows, `process.Kill` calls `TerminateProcess` (line 48 of `process_windows.go`), which kills the service immediately without running any Go shutdown code. Hostagents, now in their own process groups, survive as orphans.

The author documented this as known behavior in the BATS test at line 342-345, with recovery via orphan detection on next startup. This is a gap in the new Windows support, not a regression in existing behavior — `TerminateProcess` always killed only the target process, and before this PR no hostagents ran on Windows.

Fix: Make `rdd svc stop` on Windows try `process.Interrupt` first (which sends `CTRL_BREAK_EVENT`, triggering the graceful shutdown path), then fall back to `TerminateProcess` only if the service does not exit within a timeout.

---

### Suggestions

1. **Double force-stop in delete path risks stale PID reuse** — `limavm_lifecycle.go:52-57` [Claude Opus 4.6, Codex GPT 5.4] (suggestion, regression)

```go
stopInstanceForcibly(ctx, logger, existingInst)
logger.Info("Deleting Lima instance", "instance", limaVM.Name)
deleteCtx, deleteCancel := context.WithTimeout(ctx, 60*time.Second)
err = limainstance.Delete(deleteCtx, existingInst, true)
```

`handleDeletion` calls `stopInstanceForcibly` at line 52, which kills processes by PID via `KillTree`. Then `limainstance.Delete` at line 57 calls Lima's `StopForcibly` (at `lima/v2@v2.0.3/pkg/instance/delete.go:24`), which sends a second forced stop using the same stale PIDs cached in `existingInst`. If a PID was recycled between the two kills, `taskkill /F /T /PID` on Windows terminates an unrelated process tree.

The same stale-PID concern applies to `shutdownAllHostagents` at line 516-523: `KillTree` kills the process at line 512, then `stopInstanceForcibly` at line 523 calls `KillTree` again on the same PIDs from `store.Inspect`.

In a single-user desktop context, the risk is low. The PIDs from `store.Inspect` belong to recently killed processes, and PID recycling typically selects much higher values.

Fix: Clear the cached PIDs in `existingInst` after the custom `stopInstanceForcibly` call so Lima's `StopForcibly` skips the duplicate kill. Or extract the WSL2/cleanup portion of `stopInstanceForcibly` into a separate function so callers that already killed the processes can call just the cleanup.

2. **`setVerbosityFromEnv` warns but continues for unsupported levels** — `pkg/external/main.go:274` [Claude Opus 4.6] (suggestion, regression)

```go
default:
    fmt.Fprintf(os.Stderr, "RDD_LOG_LEVEL=%q is not supported for klog; valid values are trace, debug, info\n", level)
}
```

When `RDD_LOG_LEVEL=warn` or `RDD_LOG_LEVEL=error` (valid for the CLI's logrus), the external controller prints a warning to stderr and continues with default verbosity. The warning looks like an error but the controller functions normally. Consider mapping `warn`/`error` to the klog default (v=0) silently or at debug level, since they represent a reasonable "less verbose" intent.

3. **Killing `wslservice.exe` globally in CI test** — `limavm-running.bats:523-525` [Claude Opus 4.6] (suggestion, gap)

```bash
if is_windows; then
    MSYS_NO_PATHCONV=1 taskkill.exe /F /IM wslservice.exe || true
fi
```

The cleanup test kills `wslservice.exe` by image name, which terminates all WSL operations system-wide. On a dedicated CI runner this is acceptable. The comments at lines 518-522 explain the rationale well. Note that this pattern must not be adopted in production code or in tests that share a runner.

---

### Design Observations

**Concerns**

1. **`stopInstanceForcibly` bundles four unrelated responsibilities** (in-scope) [Claude Opus 4.6]

The function kills process trees, unlocks disks, terminates WSL2 distros, and cleans up temp files. In `shutdownAllHostagents`, only the WSL2/cleanup portion is needed (the process was already killed by `KillTree` above). In `handleDeletion`, the caller needs all four, but Lima's `Delete` duplicates the process killing. Splitting into `killInstanceProcesses` + `cleanupInstanceResources` would let each caller compose the pieces it needs and eliminate the redundant kills. The split is small — two functions instead of one, same total code — and aligns with how the callers actually use the function.

2. **`process.Kill` semantics diverge between platforms** (in-scope) [Codex GPT 5.4]

On Unix, `Kill` sends SIGTERM (graceful). On Windows, `Kill` calls `TerminateProcess` (forceful, bypasses all handlers). Code that calls `process.Kill` expecting graceful shutdown works on Unix but orphans child processes on Windows. A dedicated `StopService` function that tries `Interrupt` first on Windows and falls back to `Kill` would keep the cleanup invariant in one place.

**Strengths**

- The cross-platform process signaling is cleanly factored. `CREATE_NEW_PROCESS_GROUP` + `CTRL_BREAK_EVENT` precisely targets the hostagent's group without affecting the parent service — the correct approach for Windows. [Claude Opus 4.6, Gemini 3.1 Pro]
- WSL2 deadlock prevention (always terminate before unregister) is defensive and well-documented. The comments at `limavm_lifecycle.go:47-51` and `limavm_lifecycle.go:601-610` explain the non-obvious rationale. [Claude Opus 4.6, Gemini 3.1 Pro]
- SSH key pre-generation avoids patching Lima's cygpath-based path conversion. Generating the key upfront with a native Windows path causes Lima to find the existing key and skip its broken keygen path. [Claude Opus 4.6]
- The BATS helper functions (`assert_process_alive`, `process_kill`, `editor_cmd`, `vm_template`) provide clean abstractions over platform differences without leaking MSYS2/WSL details into test logic. [Claude Opus 4.6]
- Bounded timeouts around all WSL2 operations prevent a degraded WSL subsystem from hanging the reconciler indefinitely. [Codex GPT 5.4, Gemini 3.1 Pro]

---

### Testing Assessment

The BATS tests exercise all major code paths (start, stop, crash recovery, orphan recovery, graceful shutdown, template changes, cleanup) and this PR makes them cross-platform. Untested scenarios, ranked by risk:

1. **Stale temp file blocking `ensureSSHKeys`** — No test covers the crash-between-keygen-and-rename scenario. A unit test that pre-creates the `.tmp` file and verifies `ensureSSHKeys` recovers would prevent regressions on the fix.
2. **Missing public key recovery** — No test covers the private-key-exists-but-public-key-missing path. A unit test should verify that the existing private key is preserved and the public key is regenerated.
3. **Windows `rdd svc stop` hostagent cleanup** — No end-to-end test verifies that `rdd svc stop` leaves no orphaned hostagents or WSL2 distros on Windows. The current suite accepts the orphan and relies on recovery.
4. **`setVerbosityFromEnv` level mapping** — The function is small but the mapping (trace=4, debug=2) could regress silently.

---

### Documentation Assessment

The inline comments are thorough, especially for the non-obvious Windows process signaling semantics and WSL2 deadlock prevention. No user-facing documentation changes are needed. If the hostagent orphaning on Windows `svc stop` is intentional long-term, consider documenting it in the CLI help text.

---

### Commit Structure

The six commits are well-structured: each addresses a single concern in logical dependency order (process signaling foundations, WSL2 lifecycle fixes, test enablement, SSH key workaround, CI migration, log level propagation). Each commit stands alone and its message accurately describes the change. No fixup commits.

---

## Agent Performance Retro

### [Claude] Claude Opus 4.6

Claude produced the most thorough review with good cross-reference tracing (e.g., tracing from `stopInstanceForcibly` into `shutdownAllHostagents` and the `handleDeletion` path). It identified the SSH key temp file issue but underrated it as Suggestion — the indefinite controller hang justifies Important. It uniquely identified the `setVerbosityFromEnv` warning behavior, the `wslservice.exe` global kill concern, and the design observation about splitting `stopInstanceForcibly`. Its design observations and strengths sections were the most detailed.

- Unique contributions: `setVerbosityFromEnv` warning, `wslservice.exe` global kill, detailed `stopInstanceForcibly` decomposition recommendation
- Accuracy: No false positives. Under-rated the SSH temp file issue.
- Depth: Good — traced into `shutdownAllHostagents`, `handleDeletion`, and Lima's `Delete`.
- Signal-to-noise: High. Every finding was actionable.

### [Codex] Codex GPT 5.4

Codex found the missing-public-key-deletes-private-key bug that no other agent caught — the most impactful unique finding. It also identified the `rdd svc stop` orphaning gap with a concrete fix suggestion. Its analysis of `process.SetGroup` + `process.Kill` interaction showed good cross-file tracing. However, it misclassified the orphaning as a regression (it is a pre-existing gap in `process.Kill` semantics — `TerminateProcess` always killed only the target process, regardless of process groups). It also overstated the double-stop PID reuse risk.

- Unique contributions: Missing public key bug, `rdd svc stop` orphaning analysis, `process.Kill` semantic divergence concern
- Accuracy: One misclassification (orphaning as regression instead of gap). Otherwise solid.
- Depth: Good — traced into `pkg/service/cmd/service.go` (not a changed file) and Lima's `delete.go`.
- Signal-to-noise: High. Three findings, all substantive.

### [Gemini] Gemini 3.1 Pro

Gemini correctly identified the SSH key temp file issue and rated it Critical — the highest severity of the three agents. It also uniquely caught the sequential shutdown timeout concern with a complete parallel-wait fix. Its analysis was concise and its fix suggestions were well-crafted. However, it had the fewest total findings (only two), and its Critical rating was slightly high given the narrow trigger window (a crash during a ~1-second keygen operation).

- Unique contributions: Sequential shutdown timeout concern with parallel-wait fix
- Accuracy: No false positives. Critical rating arguably too high for the SSH temp file issue.
- Depth: Moderate — focused on the most-changed files without tracing as deeply into callers and cross-file interactions.
- Signal-to-noise: Very high. Two findings, both substantive, zero noise.

### Summary

| Metric | Claude Opus 4.6 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 7:29 | 4:40 | 4:12 |
| Critical | 0 | 0 | 1 |
| Important | 2 | 2 | 1 |
| Suggestion | 3 | 1 | 0 |
| Design observations | 4 | 2 | 0 |
| False positives | 0 | 0 | 0 |
| Unique insights | 3 | 2 | 1 |

All three agents produced zero false positives — a strong showing. Codex delivered the highest-impact unique find (the missing-public-key bug). Claude provided the broadest coverage and best design analysis. Gemini was the most concise and efficient, finishing fastest with the highest signal-to-noise ratio. The multi-agent approach proved its value: the two findings that warrant pre-merge fixes were each caught by different agent subsets (temp file: Claude + Gemini; missing pub key: Codex only).

---

## Appendix: Original Reviews

### Claude Opus 4.6

Now I have a thorough understanding of the changes. Let me compile the review.

---

## Executive Summary

This PR makes Lima VM management work correctly on Windows by fixing process signaling (avoiding console-wide CTRL_BREAK that killed the RDD service), hardening WSL2 lifecycle management (terminate-before-unregister to prevent wslservice.exe deadlocks), and migrating the Windows CI from WSL2-based BATS to MSYS2-based BATS. The code is well-structured with clear separation of cross-platform concerns. The key findings are around edge-case robustness in the new `stopInstanceForcibly` and SSH key generation.

---

## Findings

### Critical Issues

None.

### Important Issues

1. **`stopInstanceForcibly` uses stale PIDs from `store.Inspect` without validation** — `limavm_lifecycle.go:583-588` (important, gap)

```go
func stopInstanceForcibly(ctx context.Context, logger logr.Logger, inst *limatype.Instance) {
	for _, pid := range []int{inst.DriverPID, inst.HostAgentPID} {
		if pid > 0 {
			if err := process.KillTree(ctx, pid); err != nil {
				logger.V(1).Info("Failed to kill process tree", "pid", pid, "error", err)
			}
		}
	}
```

The PIDs come from `store.Inspect`, which reads the PID file on disk. If the hostagent exited abnormally (SIGKILL) without cleaning up its PID file, the PID could have been recycled by the OS. On Unix, `KillTree` at line 38 of `process_unix.go` sends `SIGKILL` to the process *group* (`-pid`), which is safer (the group ID is unique to processes started with `SetGroup`). On Windows, `KillTree` at line 62 of `process_windows.go` runs `taskkill /F /T /PID` — if the PID was recycled, this kills an unrelated process tree.

This is partially mitigated because:
- Graceful shutdown is attempted first (in `shutdownHostagent` and `killOrphanedHostagent`)
- `stopInstanceForcibly` is only called as a last resort after graceful shutdown fails
- Windows PID recycling typically targets much higher PID values than recently exited processes

But `handleDeletion` at line 52 calls `stopInstanceForcibly` directly without attempting graceful shutdown first (by design — the comment at lines 47-51 explains why).

Fix: Consider checking whether the PID still belongs to a hostagent process before killing (e.g., via `tasklist /FI "PID eq ..."` and checking the image name, or opening the process and checking the executable path). This may be overkill for single-user use, but it's worth noting.

2. **`shutdownAllHostagents` still calls `stopInstanceForcibly` redundantly on already-killed trees** — `limavm_controller.go:510-524` (important, gap)

```go
		case <-time.After(gracefulShutdownTimeout):
			graceful = false
			if state.cmd != nil && state.cmd.Process != nil {
				_ = process.KillTree(context.Background(), state.cmd.Process.Pid)
			}
			<-state.procExited
		}
		if !graceful {
			logger := ctrl.Log.WithName("shutdownAllHostagents")
			cleanupCtx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
			inst, err := store.Inspect(cleanupCtx, name)
			if err == nil && inst != nil {
				stopInstanceForcibly(cleanupCtx, logger, inst)
			}
			cancel()
		}
```

After the graceful timeout at line 509, `KillTree` at line 512 kills the process tree, then `stopInstanceForcibly` at line 523 calls `KillTree` again on the same PIDs (read from `store.Inspect`). The duplicate kill is harmless but the real purpose is the WSL2 distro termination and tmp file cleanup in `stopInstanceForcibly` (lines 601-632). The code would be clearer if the intent were explicit.

Fix: Extract the WSL2/cleanup portion of `stopInstanceForcibly` into a separate function (e.g., `cleanupInstanceResources`) and call it here instead. This avoids the redundant KillTree and makes the intent obvious. Alternatively, a comment explaining that the KillTree at line 512 handles process termination while `stopInstanceForcibly` handles WSL2 and file cleanup.

### Suggestions

1. **`ensureSSHKeys` does not clean up leftover temp files before ssh-keygen** — `ssh_keys_windows.go:49-58` (suggestion, gap)

```go
	tmpPath := privPath + ".tmp"
	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
	defer cancel()
	cmd := exec.CommandContext(ctx, "ssh-keygen", "-t", "ed25519", "-q", "-N", "",
		"-C", "lima", "-f", tmpPath)
	if out, err := cmd.CombinedOutput(); err != nil {
		_ = os.Remove(tmpPath)
		_ = os.Remove(tmpPath + ".pub")
		return fmt.Errorf("failed to generate SSH key: %s: %w", out, err)
	}
```

If the process crashes between ssh-keygen completing (line 54) and the rename (line 60), `tmpPath` persists. On the next invocation, `ssh-keygen -f tmpPath` finds the file already exists and may prompt for overwrite confirmation (even with `-q`, which only suppresses the randomart display). The 30-second timeout would eventually kill it, but this is a silent hang on startup.

Fix: Add pre-cleanup before keygen:
```diff
 	tmpPath := privPath + ".tmp"
+	_ = os.Remove(tmpPath)
+	_ = os.Remove(tmpPath + ".pub")
 	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
```

2. **`setVerbosityFromEnv` silently continues after printing an unsupported level warning** — `pkg/external/main.go:274` (suggestion, regression)

```go
	default:
		fmt.Fprintf(os.Stderr, "RDD_LOG_LEVEL=%q is not supported for klog; valid values are trace, debug, info\n", level)
	}
```

The warning at line 274 prints to stderr but the controller continues with default verbosity. When `RDD_LOG_LEVEL=warn` or `RDD_LOG_LEVEL=error` is set (valid for the CLI's logrus), the external controller silently ignores it. This is probably intentional but could surprise operators. Consider mapping `warn`/`error` to the klog default (v=0) with a debug-level log explaining why, rather than printing a warning that looks like an error.

3. **`limavm-running.bats` cleanup test kills wslservice.exe globally** — `limavm-running.bats:525` (suggestion, gap)

```bash
    if ! timeout 90 rdd ctl wait --for=delete "limavm/${VM_NAME}" --namespace "${NAMESPACE}" --timeout=60s; then
        if is_windows; then
            MSYS_NO_PATHCONV=1 taskkill.exe /F /IM wslservice.exe || true
        fi
        false
    fi
```

Killing `wslservice.exe` by image name at line 525 terminates all WSL operations system-wide, not just the test's distro. In CI this is acceptable (dedicated runner), but this recovery mechanism should not be adopted in production code. The comment at lines 518-522 explains the rationale well but consider adding a note that this is CI-only and should not be replicated in the controller.

---

## Design Observations

### Concerns

1. **`stopInstanceForcibly` combines four distinct responsibilities** (in-scope)

The function kills process trees, unlocks disks, terminates WSL2 distros, and cleans up tmp files. It is called from three different contexts: `handleDeletion`, `shutdownHostagent`, and `killOrphanedHostagent`. In `shutdownAllHostagents`, only the WSL2/cleanup portion is needed (the process is already killed by `KillTree` a few lines above). Splitting into `killInstanceProcesses` + `cleanupInstanceResources` would let callers compose the pieces they need, and would eliminate the redundant kill in `shutdownAllHostagents`. The split is small (two functions instead of one, with the same total code) and aligns with how the callers actually use it.

### Strengths

- **Cross-platform process signaling is cleanly factored.** The `process` package provides `SetGroup`, `Interrupt`, `Kill`, and `KillTree` with correct platform-specific semantics. `CREATE_NEW_PROCESS_GROUP` + `CTRL_BREAK_EVENT` on Windows precisely targets the hostagent's group without affecting the parent service — this is the correct approach and avoids the console-wide `GenerateConsoleCtrlEvent(0)` that Lima's `SysKill` uses.

- **WSL2 deadlock prevention is defensive and well-documented.** Always terminating the distro before unregistering (even for "stopped" distros) prevents a kernel-level deadlock that is unfixable once triggered. The comments at `limavm_lifecycle.go:47-51` and `limavm_lifecycle.go:601-610` explain the non-obvious rationale clearly.

- **SSH key pre-generation is an elegant workaround.** Rather than patching Lima's cygpath-based path conversion, generating the key upfront with a native Windows path causes Lima to find it and skip its broken codepath entirely (`ssh_keys_windows.go:19-24`). The atomic temp-file-then-rename pattern handles interruption correctly.

- **The BATS helper functions (`assert_process_alive`, `process_kill`, `editor_cmd`) provide clean abstractions** over the platform differences without leaking MSYS2/WSL details into the test logic.

---

## Testing Assessment

Test coverage for the new functionality is good — the existing BATS tests exercise all major code paths (start, stop, crash recovery, orphan recovery, graceful shutdown, template changes, cleanup) and the changes make them cross-platform. Specific untested scenarios:

1. **No unit tests for `stopInstanceForcibly`** — The function has complex conditional logic (WSL2 distro termination, tmp file cleanup using `os.ReadDir` + suffix matching). A unit test with a mock instance directory could verify the cleanup logic, especially the edge case where `filepath.Glob` was replaced by `os.ReadDir` for Windows bracket-path safety (line 613-614).
2. **No test for `ensureSSHKeys` failure recovery** — The crash-between-keygen-and-rename scenario and the corrupt-keypair (private without public) path are untested.
3. **No test for `setVerbosityFromEnv`** — The function is small but the mapping (trace→4, debug→2) could regress silently.

---

## Commit Structure

The six commits are well-structured, each addressing a single concern in logical dependency order: process signaling foundations → WSL2 lifecycle fixes → test enablement → SSH key workaround → CI migration → log level propagation. Each commit stands alone and its message accurately describes the change. No fixup commits are present.

---

### Codex GPT 5.4

**Verdict**

Needs changes before merge.

### Executive Summary

This PR fixes the Windows Lima hostagent signaling model, hardens WSL2 teardown, and moves Windows BATS execution to MSYS2. The main regressions I found are that `rdd svc stop` / `rdd svc delete` now leave Lima hostagents orphaned on Windows, and the new SSH-key bootstrap can rotate Lima's private key on startup when only the public key is missing, breaking access to existing VMs.

### Findings

**Critical Issues**

None.

**Important Issues**

1. **`rdd svc stop` now leaves Lima hostagents orphaned on Windows** — [`pkg/service/cmd/service.go:340-343`](/Users/jan/git/rancher-desktop-daemon/pkg/service/cmd/service.go#L340) `(important, regression)`

```go
pid := PID()
if err := process.Kill(pid); err != nil {
	return fmt.Errorf("failed to stop %q control plane with pid %d: %w", instance.Name(), pid, err)
}
```

`startInstance()` now puts every hostagent into its own process group with `process.SetGroup(haCmd)` at `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:393-395`, and `process.SetGroup()` on Windows now applies `CREATE_NEW_PROCESS_GROUP` at `pkg/util/process/process_windows.go:22-26`. That fixes VM stop/restart signaling, but `rdd svc stop` still hard-kills only the service with `process.Kill(pid)` at line 341, and `process.Kill()` still maps to `TerminateProcess` on Windows at `pkg/util/process/process_windows.go:37-53`. Because the service is terminated instead of interrupted, the manager shutdown path at `pkg/controllers/lima/limavm/controllers/limavm_controller.go:475-478` never runs, so `shutdownAllHostagents()` is skipped and the Lima hostagent survives as an orphan. The new BATS comment at `bats/tests/33-lima-controllers/limavm-running.bats:342-345` explicitly documents this orphan-on-stop behavior, so this is a confirmed regression rather than a theoretical edge case.

Fix: on Windows, make `StopWithWait()` try a graceful signal first, so the manager can execute `waitForShutdown()` and clean up hostagents, then fall back to `process.Kill()` only if the service does not exit in time.

```diff
pid := PID()
-if err := process.Kill(pid); err != nil {
-	return fmt.Errorf("failed to stop %q control plane with pid %d: %w", instance.Name(), pid, err)
+if runtime.GOOS == "windows" {
+	if err := process.Interrupt(pid); err == nil {
+		// wait path below handles normal shutdown
+	} else if err := process.Kill(pid); err != nil {
+		return fmt.Errorf("failed to stop %q control plane with pid %d: %w", instance.Name(), pid, err)
+	}
+} else if err := process.Kill(pid); err != nil {
+	return fmt.Errorf("failed to stop %q control plane with pid %d: %w", instance.Name(), pid, err)
 }
```

2. **Missing `user.pub` deletes the existing private key and breaks existing Lima VMs** — [`pkg/controllers/lima/limavm/controllers/ssh_keys_windows.go:32-38`](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/ssh_keys_windows.go#L32) `(important, regression)`

```go
if _, err := os.Stat(privPath); err == nil {
	if _, err := os.Stat(pubPath); err == nil {
		return nil
	}
	// Private key exists without public key — corrupt keypair.
	// Remove and regenerate.
	_ = os.Remove(privPath)
}
```

When only `user.pub` is missing, `ensureSSHKeys()` deletes the still-valid private key at line 38, and this helper now runs unconditionally during controller startup at `pkg/controllers/lima/limavm/controllers/limavm_controller.go:452-454`. That rotates Lima's SSH identity on every controller restart for this recoverable case. Existing guests trust the old public key, so deleting the private key makes already-created Lima VMs unreachable over SSH. A missing public key is not proof that the private key is corrupt; it can be regenerated from the private key.

Fix: preserve `privPath` and regenerate only `pubPath`, for example by running `ssh-keygen -y -f <privPath>` into a temp file and then renaming that temp file into place.

```diff
 if _, err := os.Stat(privPath); err == nil {
 	if _, err := os.Stat(pubPath); err == nil {
 		return nil
 	}
-	// Private key exists without public key — corrupt keypair.
-	// Remove and regenerate.
-	_ = os.Remove(privPath)
+	// Public key is recoverable from the existing private key.
+	return regeneratePublicKey(privPath, pubPath)
 }
```

**Suggestions**

1. **Avoid double-stopping the same stale PIDs during delete** — [`pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:52-57`](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go#L52) `(suggestion, regression)`

```go
stopInstanceForcibly(ctx, logger, existingInst)
logger.Info("Deleting Lima instance", "instance", limaVM.Name)
deleteCtx, deleteCancel := context.WithTimeout(ctx, 60*time.Second)
err = limainstance.Delete(deleteCtx, existingInst, true)
```

`handleDeletion()` now pre-kills by PID at line 52 and then immediately calls `limainstance.Delete()` at line 57. Lima's `Delete()` still calls `StopForcibly(inst)` unconditionally in `/Users/jan/go/pkg/mod/github.com/lima-vm/lima/v2@v2.0.3/pkg/instance/delete.go:24`, so deletion now sends a second forced stop to the stale `DriverPID` / `HostAgentPID` cached in `existingInst`. With fast PID reuse, especially on Windows, that second kill can target an unrelated process.

Fix: after the custom pre-stop, either call the unregister/remove steps directly instead of `limainstance.Delete()`, or clear the cached PIDs before calling into Lima so the second stop is skipped.

### Design Observations

**Concerns**

- `process.SetGroup()` and `process.Kill()` now encode two different Windows lifecycle policies, but `rdd svc stop` still uses the old "hard-kill the service" path while Lima hostagents use the new "isolated process group" path. A dedicated service-stop API that always attempts graceful shutdown first would keep the cleanup invariant in one place instead of relying on unrelated process helpers. `(in-scope)`

**Strengths**

- Replacing console-wide `CTRL_BREAK_EVENT` behavior with targeted process-group interrupts is the right fix for the original "stopping a VM kills the service" bug.
- Adding bounded timeouts around WSL2 delete paths is a good operational safeguard; it prevents a wedged `wsl.exe` call from hanging the reconciler indefinitely.

### Testing Assessment

1. Windows needs an end-to-end test that `rdd svc stop` and `rdd svc delete` leave no Lima hostagent or WSL2 distro behind. The current BATS suite now explicitly accepts an orphan and only verifies recovery on the next start.
2. Windows needs coverage for the "private key exists, public key missing" startup path to ensure the controller regenerates only `user.pub` and preserves SSH access to an existing VM.
3. The delete path should be exercised with the new pre-stop logic, ideally proving that the second Lima `Delete()` call cannot reuse stale PIDs after the custom kill step.

### Documentation Assessment

If the current Windows service-stop behavior is intentional, the CLI/help text should say that `rdd svc stop` is not a clean LimaVM shutdown on Windows and relies on orphan recovery on the next startup. Otherwise no user-facing docs are needed once that regression is fixed.

---

### Gemini 3.1 Pro

### Executive Summary

This PR successfully isolates the RDD service from Lima hostagent process signals on Windows, fixing a major lifecycle issue where terminating a VM could kill the entire service. It also dramatically improves cleanup resiliency by ensuring WSL2 distros and stale lockfiles are handled robustly even in degraded states. The most significant finding is a critical regression in the new Windows SSH key generation routine that can cause the controller to hang indefinitely on startup if interrupted.

### Findings

#### Critical Issues

1. **Interrupted SSH key generation causes controller to hang on startup** — `pkg/controllers/lima/limavm/controllers/ssh_keys_windows.go:49-60` (critical, regression)

```go
       // Generate into a temporary path and rename on success, so an
       // interrupted ssh-keygen does not leave a partial key that blocks
       // future attempts.
       tmpPath := privPath + ".tmp"
       ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
       defer cancel()
       cmd := exec.CommandContext(ctx, "ssh-keygen", "-t", "ed25519", "-q", "-N", "",
               "-C", "lima", "-f", tmpPath)
```

If the controller stops or crashes during `ssh-keygen` execution, the `tmpPath` file is left behind. On subsequent starts, `ssh-keygen` detects that `tmpPath` already exists and waits for a TTY response to `Overwrite (y/n)?`. Because it is executing in the background, it hangs silently until the 30-second context timeout expires, permanently blocking the controller from starting up until the user manually deletes the `.tmp` file.

**Fix**: Explicitly clean up `tmpPath` *before* generating the key. Using `defer` to handle all cleanup simplifies the downstream logic significantly:

```diff
--- a/pkg/controllers/lima/limavm/controllers/ssh_keys_windows.go
+++ b/pkg/controllers/lima/limavm/controllers/ssh_keys_windows.go
@@ -46,21 +46,18 @@ func ensureSSHKeys() error {
 	// interrupted ssh-keygen does not leave a partial key that blocks
 	// future attempts.
 	tmpPath := privPath + ".tmp"
+	_ = os.Remove(tmpPath)
+	_ = os.Remove(tmpPath + ".pub")
+	defer os.Remove(tmpPath)
+	defer os.Remove(tmpPath + ".pub")
+
 	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
 	defer cancel()
 	cmd := exec.CommandContext(ctx, "ssh-keygen", "-t", "ed25519", "-q", "-N", "",
 		"-C", "lima", "-f", tmpPath)
 	if out, err := cmd.CombinedOutput(); err != nil {
-		// Clean up any partial files left by ssh-keygen.
-		_ = os.Remove(tmpPath)
-		_ = os.Remove(tmpPath + ".pub")
 		return fmt.Errorf("failed to generate SSH key: %s: %w", out, err)
 	}
 	if err := os.Rename(tmpPath+".pub", pubPath); err != nil {
-		_ = os.Remove(tmpPath)
-		_ = os.Remove(tmpPath + ".pub")
 		return fmt.Errorf("failed to rename public key: %w", err)
 	}
 	if err := os.Rename(tmpPath, privPath); err != nil {
 		_ = os.Remove(pubPath)
-		_ = os.Remove(tmpPath)
 		return fmt.Errorf("failed to rename private key: %w", err)
 	}
 	return nil
```

#### Important Issues

2. **Sequential wait during daemon shutdown multiplies timeout risk** — `pkg/controllers/lima/limavm/controllers/limavm_controller.go:503-531` (important, gap)

```go
       // TODO: Wait on all hostagents in parallel instead of sequentially; with the
       // current loop, the total wait is N × gracefulShutdownTimeout in the worst case.
       for name, state := range states {
               graceful := true
               select {
               case <-state.procExited:
               case <-time.After(gracefulShutdownTimeout):
```

System service managers (like systemd or Windows SCM) enforce a strict timeout (e.g., 30s to 90s) during shutdown. If `rdd svc stop` triggers graceful shutdown for 3 struggling instances, waiting sequentially blocks the daemon for `3 × 30s = 90s`. This practically guarantees that the OS will send `SIGKILL` to the RDD process before it can iterate through the list, meaning trailing VMs miss the `stopInstanceForcibly` sweep and leave unmanaged WSL distros running.

**Fix**: Address the TODO by waiting for the processes in parallel using a `sync.WaitGroup` and a single shared timeout:

```go
	// Send graceful shutdown signal to all hostagents in parallel.
	for _, state := range states {
		if state.cmd != nil && state.cmd.Process != nil {
			_ = process.Interrupt(state.cmd.Process.Pid)
		}
	}

	logger := ctrl.Log.WithName("shutdownAllHostagents")
	timeoutCtx, cancel := context.WithTimeout(context.Background(), gracefulShutdownTimeout)
	defer cancel()

	var wg sync.WaitGroup
	for name, state := range states {
		wg.Add(1)
		go func(n string, s *instanceState) {
			defer wg.Done()
			graceful := true
			select {
			case <-s.procExited:
			case <-timeoutCtx.Done():
				graceful = false
				if s.cmd != nil && s.cmd.Process != nil {
					_ = process.KillTree(context.Background(), s.cmd.Process.Pid)
				}
				<-s.procExited
			}
			if !graceful {
				cleanupCtx, cleanupCancel := context.WithTimeout(context.Background(), 30*time.Second)
				defer cleanupCancel()
				if inst, err := store.Inspect(cleanupCtx, n); err == nil && inst != nil {
					stopInstanceForcibly(cleanupCtx, logger, inst)
				}
			}
			s.cancel()
			r.instanceStatesMu.Lock()
			delete(r.instanceStates, n)
			r.instanceStatesMu.Unlock()
		}(name, state)
	}
	wg.Wait()
```

#### Suggestions

None.

### Design Observations

**Strengths**
- The integration of `CREATE_NEW_PROCESS_GROUP` on Windows to scope `CTRL_BREAK_EVENT` explicitly solves a major platform pain-point transparently and natively.
- The fallback structure in `shutdownHostagent` and `killOrphanedHostagent` is exceptionally resilient. If a tracked PID no longer exists or the signal fails, safely falling back to `KillTree` combined with explicit WSL termination and manual `os.Remove` of trailing sockets gracefully circumvents Lima's PID/socket rot issues.

### Testing Assessment

The test coverage is solid. Moving away from execution within WSL2 to using native MSYS2 avoids path-conversion edge cases, and explicitly invoking `wslservice.exe` recovery handles CI instability well. The tests now verify the controller's behavioral bounds instead of struggling against environmental quirks.

### Documentation Assessment

The inline comments are outstanding. Complex process operations (like why `wsl --terminate` is necessary before `--unregister` or why `os.ReadDir` is chosen over `filepath.Glob` due to brackets) are thoroughly justified.

### Commit Structure

Clean, granular, and properly sequenced. Switching the BATS driver logic and addressing individual fixes per commit facilitates easy regression tracking.
