# Deep Review: 20260320-234220-pr-242 | | | |---|---| | **Date** | 2026-03-20 23:42 | | **Author** | [@jandubois](https://github.com/jandubois) | | **PR** | [#242](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/242) — Msys2 ci | | **Branch** | `msys2-ci` | | **Commits** | `eb0f332` Honor RDD_LOG_LEVEL in external controllers
`9557d83` Run Windows BATS tests via MSYS2 instead of WSL2
`0936e8c` Pre-generate Lima SSH keys on Windows to avoid broken path conversion
`37bec17` Enable Lima BATS tests on Windows
`4d1b877` Fix stopInstanceForcibly to terminate WSL2 distro and clean up files
`ddb5fbf` Fix Windows process signaling for Lima hostagent lifecycle | | **Reviewers** | Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro | | **Verdict** | **Merge with fixes** — one important bug in SSH key generation can permanently wedge the Lima controller on Windows | | **Wall-clock time** | 15 min 50 s | --- ## Consolidated Review ### Executive Summary This PR migrates Windows BATS tests from WSL2 to MSYS2, adds cross-platform process signaling for the Lima hostagent lifecycle, pre-generates SSH keys to work around Lima's broken cygpath conversion, and fixes forceful instance shutdown for WSL2. The Go changes consolidate three separate shutdown sequences into one (`shutdownHostagent`) and introduce a clean `process` package abstraction. The only significant defect: `ensureSSHKeys` on Windows cannot recover from a partial rename failure or from a state where the public key exists without its private key, permanently blocking controller startup. ### Critical Issues None. ### Important Issues 1. **`ensureSSHKeys` partial rename leaves unrecoverable state** — `ssh_keys_windows.go:32-62` [Claude Opus 4.6, Codex GPT 5.4] (important, regression) ```go if _, err := os.Stat(privPath); err == nil { if _, err := os.Stat(pubPath); err == nil { return nil } _ = os.Remove(privPath) } // ... if err := os.Rename(tmpPath+".pub", pubPath); err != nil { return fmt.Errorf("failed to rename public key: %w", err) } if err := os.Rename(tmpPath, privPath); err != nil { return fmt.Errorf("failed to rename private key: %w", err) } ``` Two failure modes, both unrecoverable: **(a)** If the public-key rename (line 57) succeeds but the private-key rename (line 60) fails, the function leaves `pubPath` (new public key) and `tmpPath` (stale private key) on disk. On the next call, `os.Stat(privPath)` fails (no private key at the final path), so the function proceeds to ssh-keygen. But `tmpPath` still exists, and ssh-keygen prompts for overwrite confirmation. With no stdin, it reads EOF, interprets it as "no," and aborts. Every subsequent call hits this dead end. **(b)** The recovery code at lines 32–38 handles "private key exists without public key" but not the inverse. If `pubPath` exists without `privPath`, the function falls through to ssh-keygen, generates new keys at `tmpPath`, then attempts `os.Rename(tmpPath+".pub", pubPath)`. On Windows, `os.Rename` fails when the destination already exists, permanently blocking the rename. Both failures prevent `SetupWithManager` from completing (line 452 of `limavm_controller.go`), which blocks the entire Lima controller. Fix: Clean up both files on error, and handle the inverse case symmetrically: ```go // Handle both half-present states before generation. if _, err := os.Stat(privPath); err == nil { if _, err := os.Stat(pubPath); err == nil { return nil } _ = os.Remove(privPath) } _ = os.Remove(pubPath) // remove orphaned public key if present // ... ssh-keygen ... if err := os.Rename(tmpPath+".pub", pubPath); err != nil { _ = os.Remove(tmpPath) _ = os.Remove(tmpPath + ".pub") return fmt.Errorf("failed to rename public key: %w", err) } if err := os.Rename(tmpPath, privPath); err != nil { _ = os.Remove(pubPath) _ = os.Remove(tmpPath) return fmt.Errorf("failed to rename private key: %w", err) } ``` --- ### Suggestions 1. **Unbounded wait after force-kill in `shutdownHostagent`** — `limavm_lifecycle.go:504,509` [Claude Opus 4.6] (suggestion, regression) ```go if !r.waitForProcessExit(stopCtx, name) { logger.Info("Hostagent did not exit gracefully, forcing stop") forceStop() r.waitForProcessExit(ctx, name) } ``` After `forceStop()` (SIGKILL/taskkill), `waitForProcessExit(ctx, name)` at line 504 uses the reconciler's context, which may have no deadline. If `KillTree` fails silently (zombie, permission issue), this blocks the reconciler indefinitely. The same pattern appears at line 509. Fix: Use a bounded context, e.g., `context.WithTimeout(ctx, 10*time.Second)`, consistent with how `shutdownAllHostagents` creates a fresh timeout for cleanup. 2. **Stale PID in `stopInstanceForcibly` after process reap** — `limavm_lifecycle.go:506-511`, `limavm_controller.go:508-523` [Gemini 3.1 Pro] (suggestion, regression) ```go // shutdownHostagent else branch: r.killHostagent(name) r.waitForProcessExit(ctx, name) // process reaped, PID free forceStop() // KillTree on inst.HostAgentPID ``` In the else branch, the process is killed and reaped before `forceStop()` calls `KillTree` on the PID read from disk. The PID is free for reuse. On Unix, `KillTree` sends to the process group (`-pid`), which no longer exists, so `Kill` returns `ESRCH` harmlessly. On Windows, `taskkill /PID` could theoretically target a recycled PID. The same pattern exists in `shutdownAllHostagents` (line 514, then line 523). Practical risk is low: the window is microseconds, and PID recycling requires exhausting the range. The real value of `forceStop()` in this path is WSL2 distro termination and tmp file cleanup, not the kill. Still, reordering `forceStop()` before `waitForProcessExit` in the else branch would eliminate the window entirely. 3. **Missing `break` in tmp file cleanup loop** — `limavm_lifecycle.go:612-623` [Gemini 3.1 Pro] (suggestion, enhancement) ```go for _, f := range entries { for _, suffix := range filenames.TmpFileSuffixes { if strings.HasSuffix(f.Name(), suffix) { // remove file... } } } ``` Once a file matches a suffix and is removed, the inner loop continues comparing its name against remaining suffixes. Adding `break` after the removal block avoids unnecessary iterations. 4. **`setVerbosityFromEnv` default message lacks guidance** — `pkg/external/main.go:274` [Claude Opus 4.6] (suggestion, regression) ```go fmt.Fprintf(os.Stderr, "RDD_LOG_LEVEL=%q is not supported for klog (only trace, debug); using default verbosity\n", level) ``` The message says the level is unsupported but does not explain which levels klog accepts. Since this runs before klog initialization, raw stderr is understandable, but the message could list the valid values (`trace`, `debug`, `info`) to help operators self-correct. --- ### Design Observations **Concerns** - **`stopInstanceForcibly` handles too many responsibilities** `(future)` [Claude Opus 4.6] — The function kills processes, unlocks disks, terminates WSL2 distros, and removes tmp files. Splitting it into `killProcesses` + `cleanupInstanceArtifacts` would let `shutdownAllHostagents` call only the cleanup half after its own `KillTree`, and would simplify future timeout consolidation (the existing TODO about hardcoded timeouts). [Gemini 3.1 Pro] - **PID-based orphan management is inherently racy** `(future)` [Gemini 3.1 Pro] — `killOrphanedHostagent` targets PIDs loaded from disk (`inst.HostAgentPID`, `inst.DriverPID`). On a heavily loaded system, a process that crashed before controller startup may have had its PID reassigned. A sanity check (executable name or process start time) before killing would reduce risk, though a complete solution requires OS-level mechanisms (pidfds on Linux, job objects on Windows). **Strengths** - The `shutdownHostagent` consolidation replaces three inline stop-signal-wait-kill sequences with one function and one shutdown protocol. All callers get consistent behavior. [Claude Opus 4.6, Codex GPT 5.4] - `process.SetGroup` + `process.Interrupt` + `process.KillTree` provides a clean cross-platform abstraction. On Windows, using `CREATE_NEW_PROCESS_GROUP` + `CTRL_BREAK_EVENT` avoids the console-group broadcast that was killing the RDD service. [Claude Opus 4.6, Gemini 3.1 Pro] - The `ensureSSHKeys` approach (pre-generate before Lima's broken cygpath path) is a clean workaround. The atomic-rename pattern prevents partial keys from being used by Lima. [Claude Opus 4.6, Codex GPT 5.4] - `stopInstanceForcibly` correctly replaces Lima's `StopForcibly` for Windows, preserving disk/tmp-file cleanup while adding WSL2 distro termination. [Codex GPT 5.4] --- ### Testing Assessment The BATS tests were updated systematically: `skip_on_windows` removed, cross-platform process helpers (`assert_process_alive`, `process_kill`) used throughout, and timeouts increased for Windows boot times. Untested scenarios, ranked by risk: 1. **`ensureSSHKeys` partial rename recovery** — No test covers the case where one rename succeeds and the other fails, or where the public key exists without the private key. This is the highest-risk gap: it can permanently wedge the controller. 2. **`RDD_LOG_LEVEL` in external controllers** — The CLI's log level is tested in `bats/tests/10-cli/4-log-level.bats`, but the new `setVerbosityFromEnv` path in `pkg/external/main.go` has no coverage. 3. **WSL2 distro termination timeout** — The 10s timeout in `stopInstanceForcibly` (line 598) is not exercised. If `wsl.exe --terminate` hangs, the test would hang rather than testing the timeout path. 4. **`shutdownHostagent` kill-then-cleanup path** — The else branch (lines 506–511, signal fails) fires only when no watcher exists but a hostagent is running — an unusual state with no direct test. --- ### Documentation Assessment No public API changes require external documentation. The inline comments are thorough: `stopInstanceForcibly` explains why Lima's `StopForcibly` cannot be used on Windows, and `ensureSSHKeys` explains the cygpath workaround. The CI workflow changes are self-explanatory. --- ### Commit Structure The commit sequence is well-ordered: process signaling foundation (`ddb5fbf`), then `stopInstanceForcibly` (`4d1b877`), SSH keys (`0936e8c`), BATS enablement (`37bec17`), and CI migration (`9557d83`). Each commit is self-contained. The final commit (`eb0f332`, RDD_LOG_LEVEL in external controllers) is tangentially related to the Windows/MSYS2 scope, but small enough that splitting it into its own PR would add overhead without reducing review risk. --- ### Unresolved Feedback No review comments on this PR. --- ## Agent Performance Retro ### [Claude] Claude produced the most thorough review. It correctly identified the SSH key partial-rename bug with two distinct failure modes and proposed two alternative fixes. It also caught the unbounded `waitForProcessExit` after force-kill, the redundant `KillTree` in `shutdownAllHostagents`, and the `setVerbosityFromEnv` message gap. One false positive: the MSYS2 path conversion finding (`commands.bash:36-44`) references a file not changed in this PR. Claude classified it as a "regression," but the file is outside the diff entirely. Claude traced into called functions, evaluated the shutdown lifecycle end-to-end, and evaluated test coverage gaps with specificity. High signal-to-noise ratio. ### [Codex] Codex delivered the most concise review. It identified the SSH key bug with accurate analysis of both the "public without private" and "partial rename" failure modes, matching Claude's findings. Its design observations were sharp: noting that `stopInstanceForcibly` is a good fit for the Windows-specific failure mode it addresses. Codex raised no false positives. However, it produced fewer findings overall — missing the unbounded wait, the PID reuse window, and the tmp file cleanup optimization. Its review was efficient but shallow compared to Claude and Gemini. ### [Gemini] Gemini raised the most provocative finding: a PID reuse race condition in the shutdown path. The analysis is technically correct — waiting for process reap before calling `KillTree` on the same PID creates a window for PID recycling. However, Gemini rated this CRITICAL despite the vanishingly small practical likelihood (microsecond window, PID range exhaustion required). On Unix, `KillTree` targets the process group, not the PID itself, further reducing risk. Downgraded to SUGGESTION. Gemini correctly identified `KillTree`'s behavior with non-process-group leaders as a concern, though it overstated the severity. The `break` optimization in the tmp file loop was a valid micro-finding. Gemini's strength was tracing the full lifecycle of PIDs through the shutdown sequence. Gemini missed the SSH key bug entirely. ### Summary | Metric | Claude Opus 4.6 | Codex GPT 5.4 | Gemini 3.1 Pro | |---|---|---|---| | Duration | 5:46 | 3:47 | 4:45 | | Critical | 0 | 0 | 1 (downgraded) | | Important | 2 | 1 | 1 | | Suggestion | 4 | 0 | 1 | | Design observations | 3 | 2 | 2 | | False positives | 1 | 0 | 1 (severity) | | Unique insights | 2 | 0 | 2 | Claude provided the most value overall: broadest coverage, deepest tracing, one false positive. Codex was the most precise but shallowest. Gemini brought a unique angle (PID lifecycle analysis) but miscalibrated severity and missed the most important bug. The multi-agent approach justified itself: Claude and Codex converged on the SSH key bug independently, while Gemini surfaced the PID reuse concern that the others missed. --- ## Improvement Recommendations ### Repo context updates - `[repo]` On Windows, `os.Rename` fails when the destination file already exists (unlike Unix, which atomically replaces). Review agents should check for this platform difference in any Windows-specific code that uses `os.Rename`. This would have helped Gemini catch the SSH key bug. --- ## Appendix: Original Reviews ### Claude Opus 4.6 Now I have a comprehensive understanding of the PR. Here is my review: --- ## Executive Summary This PR enables Windows BATS testing via MSYS2 (replacing WSL2), adds cross-platform process signaling for Lima hostagent lifecycle management, pre-generates SSH keys to work around Lima's broken cygpath conversion, and fixes forceful instance shutdown to properly terminate WSL2 distros and clean up tmp files. The Go changes are well-designed — `shutdownHostagent` consolidates three separate stop-and-cleanup sequences into one — but the SSH key generation has a partial-failure mode that can permanently block key regeneration. ## Findings ### Critical Issues None. ### Important Issues 1. **Partial rename in `ensureSSHKeys` can permanently block key generation** — `ssh_keys_windows.go:57-62` (important, regression) ```go if err := os.Rename(tmpPath+".pub", pubPath); err != nil { return fmt.Errorf("failed to rename public key: %w", err) } if err := os.Rename(tmpPath, privPath); err != nil { return fmt.Errorf("failed to rename private key: %w", err) } ``` If the public key rename at line 57 succeeds but the private key rename at line 60 fails, the function returns an error without cleaning up. The state left behind: `tmpPath` (stale private key), `pubPath` (renamed public key), no `tmpPath+".pub"`. On the next call, `os.Stat(privPath)` at line 32 fails (no private key at final path), so the function proceeds to ssh-keygen. But `tmpPath` still exists from the previous run, and ssh-keygen prompts for overwrite confirmation. With no stdin attached, ssh-keygen reads EOF, interprets it as "no", and fails. Every subsequent call hits the same dead end. The recovery code at lines 32-38 only handles "private key exists without public key" — it doesn't handle the inverse case. Fix: Reverse the rename order (private key first, then public key). If the private key rename succeeds but the public key rename fails, the next call enters the "private key exists without public key" path at line 36, removes the orphaned private key, and regenerates cleanly. Alternatively, add cleanup to the error paths: ```go if err := os.Rename(tmpPath+".pub", pubPath); err != nil { _ = os.Remove(tmpPath) _ = os.Remove(tmpPath + ".pub") return fmt.Errorf("failed to rename public key: %w", err) } if err := os.Rename(tmpPath, privPath); err != nil { _ = os.Remove(pubPath) // undo the successful rename _ = os.Remove(tmpPath) return fmt.Errorf("failed to rename private key: %w", err) } ``` 2. **MSYS2 path conversion heuristic may convert non-filesystem arguments** — `bats/helpers/commands.bash:36-44` (important, regression) ```bash for arg in "$@"; do if [[ "${arg}" =~ ^/([a-zA-Z]/|tmp(/|$)) ]]; then args+=("$(cygpath -w "${arg}")") elif [[ "${arg}" == /* ]] && [[ -e "${arg}" ]]; then args+=("$(cygpath -w "${arg}")") else args+=("${arg}") fi done ``` The `elif` branch at line 40 converts any `/`-prefixed argument that exists on the filesystem. This means that if an API path like `/status` coincidentally matches an existing directory `/status`, it would be silently converted to `C:\status`. The existence check is a reasonable heuristic, but it conflates filesystem paths with API paths — a class of bugs that's hard to reproduce and diagnose. Fix: Consider restricting the existence-based conversion to paths that start with known mount prefixes (e.g., `/home/`, the BATS root, the repo root), or documenting this as a known limitation. The first regex branch (`^/([a-zA-Z]/|tmp(/|$))`) is well-bounded; extending that pattern is safer than a blanket existence check. ### Suggestions 1. **`shutdownHostagent` second `waitForProcessExit` has no timeout** — `limavm_lifecycle.go:504` (suggestion, regression) ```go if !r.waitForProcessExit(stopCtx, name) { logger.Info("Hostagent did not exit gracefully, forcing stop") forceStop() r.waitForProcessExit(ctx, name) } ``` After `forceStop()` (which calls `KillTree` with SIGKILL/taskkill /F), `waitForProcessExit(ctx, name)` at line 504 uses the reconciler's context, which typically has no deadline. In practice, the process should die quickly after SIGKILL, but if `KillTree` fails silently (e.g., a zombie or permission issue), this blocks the reconciler indefinitely. The same pattern appears at line 509. Fix: Use a bounded context, e.g., `context.WithTimeout(ctx, 10*time.Second)`, consistent with how `shutdownAllHostagents` at line 520 creates a fresh timeout context for cleanup. 2. **`stopInstanceForcibly` called redundantly in `shutdownAllHostagents`** — `limavm_controller.go:512,523` (suggestion, gap) ```go _ = process.KillTree(context.Background(), state.cmd.Process.Pid) // ... stopInstanceForcibly(cleanupCtx, logger, inst) ``` `KillTree` at line 512 kills the hostagent and its descendants. Then `stopInstanceForcibly` at line 523 calls `KillTree` again on the same PID (now dead, so it fails silently). This is intentional — `stopInstanceForcibly` also handles WSL2 distro termination and tmp file cleanup — but extracting the cleanup-only steps (WSL2 terminate + tmp file removal) into a separate function would make the intent clearer and avoid the redundant kill. 3. **`setVerbosityFromEnv` stderr message isn't structured** — `pkg/external/main.go:274` (suggestion, regression) ```go fmt.Fprintf(os.Stderr, "RDD_LOG_LEVEL=%q is not supported for klog (only trace, debug); using default verbosity\n", level) ``` The error messages in `setVerbosityFromEnv` use raw `fmt.Fprintf` to stderr. Since this runs before klog is fully initialized, this is understandable, but the message also mentions `warn` and `error` levels are unsupported without suggesting what to do. Consider mentioning that klog only has numeric verbosity levels and the mapping is trace=4, debug=2. 4. **`editor_cmd` single-quote wrapping fragile for paths with quotes** — `bats/tests/33-lima-controllers/limavm-running.bats:85-92` (suggestion, regression) ```bash editor_cmd() { local script=$1 if is_msys; then echo "bash '$(cygpath -m "${script}")'" else echo "${script}" fi } ``` If `$script` (from `BATS_TEST_TMPDIR`) ever contains a single quote, the wrapping breaks. This is unlikely in practice (BATS temp dirs don't contain quotes), but `printf '%q'` would be more robust. ## Design Observations **Concerns** - **`stopInstanceForcibly` mixes process killing, disk cleanup, and distro termination** — the function does three unrelated things: kill processes, unlock disks, terminate WSL2 distros, and clean tmp files. As the TODO notes, the shutdown path has several hardcoded timeouts (10s for WSL terminate, 30s for graceful shutdown, 60s for Delete). Splitting `stopInstanceForcibly` into `killProcesses` + `cleanupInstanceArtifacts` would let `shutdownAllHostagents` call only the cleanup half after its own `KillTree`, and would make timeout consolidation easier. (future) **Strengths** - The `shutdownHostagent` consolidation (replacing three inline stop-signal-wait-kill sequences) is a clear improvement — one function, one shutdown protocol, consistent behavior across all callers. - The `process` package's cross-platform API (`Interrupt`, `KillTree`, `SetGroup`) is well-designed: the Unix and Windows implementations are genuinely different in mechanism but identical in semantics. Using `CREATE_NEW_PROCESS_GROUP` + `CTRL_BREAK_EVENT` on Windows instead of `GenerateConsoleCtrlEvent(CTRL_C)` avoids the console-group broadcast that was killing the RDD service. - The `ensureSSHKeys` approach (pre-generate before Lima's broken cygpath path) is a clean workaround with atomic rename semantics. The temporary path + rename pattern prevents partial keys from being used. - The BATS `vm_template()` / `vm_template_running()` abstraction cleanly separates template selection from test logic, and the `RDD_WSL_DISTRO` knob provides easy switching between distros. ## Testing Assessment Test coverage is strong. The BATS tests were updated systematically: `skip_on_windows` removed, cross-platform process management (`assert_process_alive`, `process_kill`) used throughout, and timeouts increased to accommodate Windows boot times. Specific untested scenarios: 1. **`ensureSSHKeys` partial rename recovery** — no unit test covers the case where one rename succeeds and the other fails. This is the highest-risk untested path in the Go changes. 2. **WSL2 distro termination timeout** — the 10s timeout in `stopInstanceForcibly` (line 598) and 60s timeout in `handleDeletion` (line 57) are not exercised in tests. If `wsl.exe --terminate` hangs, the test would hang rather than testing the timeout path. 3. **`shutdownHostagent` fallback to `killHostagent`** — the else branch at line 506-511 (signal fails, kill directly) is not directly tested. It fires when no watcher exists but a hostagent is somehow running — an unusual state. ## Documentation Assessment No public API changes require external documentation. The in-code comments are thorough — the `stopInstanceForcibly` comment explains exactly why Lima's `StopForcibly` can't be used on Windows, and the `ensureSSHKeys` comment explains the cygpath workaround. The CI workflow changes are self-explanatory. ## Commit Structure The commit sequence is well-ordered and each commit is self-contained: 1. `ddb5fbf` (process signaling) lays the foundation 2. `4d1b877` (stopInstanceForcibly) builds on it for WSL2 cleanup 3. `0936e8c` (SSH keys) addresses a separate Windows porting issue 4. `37bec17` (enable Lima BATS) enables the tests 5. `9557d83` (MSYS2 CI) switches the CI infrastructure 6. `eb0f332` (RDD_LOG_LEVEL) is independent and could be a separate PR, but it's small and related to Windows CI debugging The only minor concern: commit 6 (`eb0f332`, RDD_LOG_LEVEL in external controllers) is somewhat unrelated to the MSYS2/Windows scope of the other commits. It would review and revert independently as its own PR, but bundling it here is pragmatic given its size. --- ### Codex GPT 5.4 **Verdict**: Needs changes ### Executive Summary This PR moves Windows BATS to MSYS2, adds Windows-specific Lima lifecycle handling, pre-generates Lima SSH keys on Windows, and propagates `RDD_LOG_LEVEL` into external controllers. The lifecycle changes are mostly coherent after tracing start/stop/delete paths into Lima and the hostagent, but I found one important Windows regression in the new SSH-key bootstrap path that can prevent the Lima controller from starting at all after a partial key-generation failure. ### Findings **Critical Issues** None. **Important Issues** 1. **Partial SSH keypair states can permanently wedge Lima controller startup on Windows** — [ssh_keys_windows.go:32-60](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/ssh_keys_windows.go#L32) `(important, regression)` ```go if _, err := os.Stat(privPath); err == nil { if _, err := os.Stat(pubPath); err == nil { return nil } _ = os.Remove(privPath) } ... if err := os.Rename(tmpPath+".pub", pubPath); err != nil { return fmt.Errorf("failed to rename public key: %w", err) } if err := os.Rename(tmpPath, privPath); err != nil { return fmt.Errorf("failed to rename private key: %w", err) } ``` `ensureSSHKeys()` only treats the "private key exists, public key missing" case as corrupt at lines 32-38, but it does nothing when `user.pub` exists and `user` is missing. In that state, the later `os.Rename(tmpPath+".pub", pubPath)` at line 57 fails on Windows because the destination already exists, and `SetupWithManager()` aborts the entire Lima controller startup at [limavm_controller.go:452-454](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_controller.go#L452). The code can also create that exact broken state itself: if line 57 succeeds but line 60 fails, the new public key is left behind without the matching private key, so every later startup hits the same rename failure again. Fix: handle both half-present cases symmetrically before generation, and clean up both final and temporary files on rename failure. The safest pattern is to generate into a temp directory, then replace both final files only after both temp files exist; if either finalization step fails, remove the partially-installed destination files before returning. **Suggestions** None. ### Design Observations **Concerns** - `(future)` Windows service shutdown still is not truly graceful. The tests now explicitly accept orphan recovery after `rdd svc stop`, but the underlying reason is still the pre-existing `process.Kill(pid)` call at [pkg/service/cmd/service.go:341](/Users/jan/git/rancher-desktop-daemon/pkg/service/cmd/service.go#L341), which maps to `TerminateProcess` on Windows and bypasses the controller's shutdown runnable. If Windows service stop should eventually guarantee hostagent cleanup before process exit, that needs to be solved in the service layer rather than in Lima controller recovery logic. **Strengths** - The new `stopInstanceForcibly()` path is a good fit for the Windows-specific failure mode it is addressing: it preserves Lima's disk/tmp-file cleanup behavior while adding explicit WSL2 distro termination and avoiding Lima's console-group kill behavior. - Replacing direct `os.Process.Signal` calls with `pkg/util/process` gives the hostagent lifecycle code one cross-platform abstraction instead of scattered Unix/Windows special cases. ### Testing Assessment 1. The highest-risk gap is the new Windows SSH-key bootstrap path. There is no test covering `user.pub` without `user`, `user` without `user.pub`, or a failure after the public key is installed but before the private key is renamed. 2. `RDD_LOG_LEVEL` is tested for the service process in [bats/tests/10-cli/4-log-level.bats](/Users/jan/git/rancher-desktop-daemon/bats/tests/10-cli/4-log-level.bats), but there is no coverage for the new external-controller path in [pkg/external/main.go:260-274](/Users/jan/git/rancher-desktop-daemon/pkg/external/main.go#L260). 3. The Windows forced-stop path is only exercised indirectly through BATS. There is still no targeted test for the `taskkill`/`wsl.exe --terminate` cleanup branch in [limavm_lifecycle.go:574-620](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go#L574). ### Documentation Assessment No user-facing documentation changes look strictly required, but the Windows-specific SSH-key workaround is subtle enough that a short code comment about the failure-recovery expectations around partial keypairs would help future maintainers. ### Commit Structure This PR is broader than the title suggests. It bundles CI migration, Lima controller lifecycle fixes, a Windows SSH-key bootstrap workaround, and external-controller logging changes into one review unit; splitting CI/workflow changes from runtime controller behavior would make both review and rollback safer. --- ### Gemini 3.1 Pro ### Executive Summary This PR successfully introduces MSYS2 CI support for Windows BATS tests, implements a clever workaround for Lima's SSH key path resolution, and replaces Lima's process termination logic to prevent `SysKill` from taking down the RDD service alongside the hostagent on Windows. The most significant finding is a critical race condition introduced by the new forceful shutdown sequences, which wait for a process to be reaped before forcefully killing its PID, risking collateral damage to unrelated applications. ### Findings #### Critical Issues 1. **Race condition in forceful shutdown killing reaped PIDs** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:498-508` (critical, regression) ```go } else { logger.Info("Could not signal hostagent, killing process directly") r.killHostagent(name) r.waitForProcessExit(ctx, name) forceStop() } ``` By calling `r.waitForProcessExit`, the sequence explicitly waits for the background `state.cmd.Wait()` to return. At this exact moment, the OS has reaped the hostagent process, and its PID is free to be reassigned to any newly spawned application. Immediately after, `forceStop()` invokes `stopInstanceForcibly`, which calls `process.KillTree` on `inst.HostAgentPID` (the identical, now reaped PID). Sending `SIGKILL` to the process group of a reassigned PID can violently terminate an innocent application. The exact same bug exists in `shutdownAllHostagents` (`pkg/controllers/lima/limavm/controllers/limavm_controller.go:506-523`), where `<-state.procExited` is awaited to reap the process before extracting `inst` and running `stopInstanceForcibly`. Fix: Perform forceful process termination *before* waiting for the process to exit and be reaped. For `shutdownHostagent`: ```go } else { logger.Info("Could not signal hostagent, forcing stop directly") forceStop() // stopInstanceForcibly already kills the hostagent tree r.waitForProcessExit(ctx, name) } ``` For `shutdownAllHostagents`, ensure `stopInstanceForcibly` is invoked *before* the `<-state.procExited` synchronization point. #### Important Issues 1. **`KillTree` silently fails for non-process group leaders on Unix** — `pkg/util/process/process_unix.go:19-24` (important, regression) ```go // KillTree sends SIGKILL to the process group led by the given PID. // The process must have been started with SetGroup so it leads its own group. func KillTree(_ context.Context, pid int) error { return unix.Kill(-pid, unix.SIGKILL) } ``` If the target `pid` is not a process group leader, `unix.Kill(-pid, unix.SIGKILL)` will fail with `ESRCH` ("no such process") and do nothing. While `rdd` correctly sets `SetGroup` for the hostagent, driver processes (like QEMU) are spawned autonomously by Lima. If Lima alters its driver execution or an unrecognized PID is encountered during `killOrphanedHostagent`, `KillTree` will abort and permanently leak the target process. Lima's original `StopForcibly` avoided this by using `syscall.Kill(pid, sig)`. Fix: Add a fallback to kill the individual process if sending the signal to the process group fails with `ESRCH`: ```go func KillTree(_ context.Context, pid int) error { err := unix.Kill(-pid, unix.SIGKILL) if err == unix.ESRCH { return unix.Kill(pid, unix.SIGKILL) } return err } ``` #### Suggestions 1. **Avoid unnecessary loop iterations during cleanup** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:531-543` (suggestion, enhancement) ```go for _, f := range entries { for _, suffix := range filenames.TmpFileSuffixes { if strings.HasSuffix(f.Name(), suffix) { path := filepath.Join(inst.Dir, f.Name()) if err := os.Remove(path); err != nil { logger.V(1).Info("Failed to remove tmp file", "path", path, "error", err) } else { logger.V(1).Info("Removed tmp file", "path", path) } } } } ``` Once a temporary file matches a suffix and is removed, there is no need to continue comparing its name against the remaining suffixes in `filenames.TmpFileSuffixes`. Fix: Add a `break` statement after the `if/else` block logging the removal to advance to the next file in `entries`. ### Design Observations **Concerns** - **PID-based orphan management races (in-scope)**: `killOrphanedHostagent` targets processes using PIDs loaded from disk state (`inst.HostAgentPID` and `inst.DriverPID`). On a heavily loaded system, a process that crashed prior to controller startup may have had its PID reassigned. Sending signals to untracked PIDs is fundamentally racy. While difficult to fully avoid without relying on cgroups or pidfds, adding a sanity check (such as matching the process start time or executable name before killing) would heavily mitigate the risk of killing unrelated system processes. **Strengths** - Using `CREATE_NEW_PROCESS_GROUP` on Windows to isolate `CTRL_BREAK_EVENT` signals is a textbook, robust solution to the well-known issue of console-wide signal propagation taking down the parent service. ### Testing Assessment The test coverage is solid. The BATS tests comprehensively cover the new MSYS2-specific behaviors. The execution correctly validates and checks graceful shutdown paths versus forced termination scenarios. ### Documentation Assessment The inline comments are excellent—particularly the rationale explaining the nuances of Windows interop (e.g. `wsl.exe` unregister hangs, `TerminateProcess` behavior, and MSYS2 SSH cygpath discrepancies).