# Deep Review: 20260319-173741-msys2-ci | | | |---|---| | **Date** | 2026-03-19 17:37 | | **Branch** | `msys2-ci` | | **Commits** | `be9c281` Fix Windows process signaling for Lima hostagent lifecycle
`1f0db09` Fix stopInstanceForcibly to terminate WSL2 distro and clean up files
`bfa614d` Enable Lima BATS tests on Windows
`625737d` Pre-generate Lima SSH keys on Windows to avoid broken path conversion
`50c1e96` Run Windows BATS tests via MSYS2 instead of WSL2 | | **Reviewers** | Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro | | **Verdict** | **Merge with fixes** — two regressions in `stopInstanceForcibly` need attention before merge | | **Wall-clock time** | 23 min 5 s | --- ## Consolidated Review ### Executive Summary This branch fixes a critical Windows bug where Lima's `StopForcibly` uses `GenerateConsoleCtrlEvent(CTRL_BREAK)` to kill hostagents, which targets the entire console group and kills the RDD service along with the hostagent. The fix introduces `CREATE_NEW_PROCESS_GROUP` for hostagent processes and a custom `stopInstanceForcibly` that uses `os.Process.Kill` (`TerminateProcess`) targeted at individual processes. The branch also adds WSL2 distro termination, SSH key pre-generation for Windows, and migrates BATS CI from WSL2 to MSYS2. The custom `stopInstanceForcibly` replaces upstream Lima's `StopForcibly` but drops two behaviors: child process cleanup on Windows and the safer `os.ReadDir` approach for temp file removal. The deletion path now removes the cleanup finalizer even when backend cleanup fails, a deliberate trade-off for WSL2 resilience that warrants discussion. ### Critical Issues None. ### Important Issues 1. **`stopInstanceForcibly` leaves orphaned child processes on Windows** — `limavm_lifecycle.go:539-545` [Gemini 3.1 Pro] ```go for _, pid := range []int{inst.DriverPID, inst.HostAgentPID} { if pid > 0 { if p, err := os.FindProcess(pid); err == nil { _ = p.Kill() } } } ``` `os.Process.Kill()` calls `TerminateProcess` on Windows, which terminates only the specified process. Lima's hostagent spawns `ssh.exe` for port forwarding; killing the hostagent without killing its children leaves those `ssh.exe` processes holding ports. The next hostagent startup fails with "address already in use." On WSL2, `wsl.exe --terminate` (line 553) cleans up in-distro processes but not host-side port forwarders. This regression was introduced in `be9c281`. (important, regression) Fix: Use `taskkill /F /T /PID ` on Windows to terminate the process tree, or enumerate and kill child processes via the Windows job object API. 2. **`stopInstanceForcibly` uses `filepath.Glob`, which fails on paths with bracket characters** — `limavm_lifecycle.go:557-561` [Gemini 3.1 Pro] ```go for _, suffix := range filenames.TmpFileSuffixes { matches, _ := filepath.Glob(filepath.Join(inst.Dir, "*"+suffix)) for _, m := range matches { _ = os.Remove(m) } } ``` Lima's upstream `StopForcibly` (at `stop.go:143-162`) uses `os.ReadDir` + `strings.HasSuffix`. The new code uses `filepath.Glob`, which treats `[` and `]` as glob meta-characters. If the user's home directory path contains brackets (e.g., `C:\Users\name[1]`), `filepath.Glob` returns `ErrBadPattern` (silently discarded), leaving `.pid` and `.sock` files behind. The next hostagent fails to start. Introduced in `1f0db09`. (important, regression) Fix: Match upstream's approach: ```go fi, err := os.ReadDir(inst.Dir) if err == nil { for _, f := range fi { for _, suffix := range filenames.TmpFileSuffixes { if strings.HasSuffix(f.Name(), suffix) { _ = os.Remove(filepath.Join(inst.Dir, f.Name())) } } } } ``` 3. **Deletion removes cleanup finalizer after backend delete failure** — `limavm_lifecycle.go:53-63` [Codex GPT 5.4] ```go deleteCtx, deleteCancel := context.WithTimeout(ctx, 60*time.Second) err = limainstance.Delete(deleteCtx, existingInst, true) deleteCancel() if err != nil { logger.Error(err, "Failed to delete Lima instance (proceeding with finalizer removal)") } ``` When `limainstance.Delete` fails (line 56), the reconciler logs the error and continues to remove the cleanup finalizer at line 80. Once the finalizer is gone, no retry mechanism exists; a failed `wsl.exe --unregister` strands the WSL2 distro permanently. The comment at lines 59-62 documents this as a deliberate trade-off to prevent the K8s resource from getting stuck, and manual cleanup via `wsl --unregister` is possible. However, the trade-off means a transient WSL2 glitch (timeout rather than permanent failure) causes permanent resource leakage with no automated recovery. Introduced in `1f0db09`. (important, regression) Fix: Distinguish transient from permanent failures. On timeout, requeue with backoff (perhaps with a retry counter) instead of removing the finalizer immediately. Remove the finalizer unconditionally only after N retries or on a non-retryable error (e.g., "instance not found"). 4. **`shutdownAllHostagents` hard-kill fallback skips WSL2 distro termination and temp file cleanup** — `limavm_controller.go:508-512` [Claude Opus 4.6] ```go case <-time.After(gracefulShutdownTimeout): if state.cmd != nil && state.cmd.Process != nil { _ = state.cmd.Process.Kill() } <-state.procExited ``` Every other forced-shutdown path (`stopInstance` at line 479, `handleRestartNeeded` at line 301, `handleDeletion` at line 50, `killOrphanedHostagent` at line 525) calls `stopInstanceForcibly`, which terminates the WSL2 distro and cleans up `.pid`/`.sock`/`.tmp` files. This fallback does `Process.Kill()` alone, leaving WSL2 distros running after `rdd svc stop` if a hostagent hangs. Pre-existing gap, but amplified by this PR establishing `stopInstanceForcibly` as the canonical cleanup path everywhere else. (important, gap) Fix: Pass a `context.Context` to `shutdownAllHostagents` and call `store.Inspect` + `stopInstanceForcibly` after the kill fallback fires. This is a structural change suitable for a follow-up. ### Suggestions 1. **`stopInstanceForcibly` silently discards all errors** — `limavm_lifecycle.go:538-563` [Claude Opus 4.6] ```go func stopInstanceForcibly(ctx context.Context, inst *limatype.Instance) { // every _ = p.Kill(), _ = exec.CommandContext(...).Run(), _ = os.Remove(m) ``` Every error is discarded: process kill (line 542), WSL2 terminate (line 553), and file removal (line 560). A failed `wsl.exe --terminate` or process kill produces no log entry. The function is a standalone helper with no logger. (suggestion, gap) Fix: Accept a `logr.Logger` parameter and log failed cleanup steps at V(1). Alternatively, make it a method on `LimaVMReconciler` to access the logger from context. 2. **`SetGroup` overwrites `SysProcAttr` instead of merging** — `process_windows.go:20-22` [Gemini 3.1 Pro] ```go func SetGroup(cmd *exec.Cmd) { cmd.SysProcAttr = &windows.SysProcAttr{CreationFlags: windows.CREATE_NEW_PROCESS_GROUP} } ``` Both the Windows and Unix implementations overwrite `cmd.SysProcAttr` entirely. If a future caller sets other process attributes before calling `SetGroup`, those settings are silently lost. Currently, the only call site (`startInstance` at line 423) creates a fresh `exec.Cmd`, so the overwrite is safe. (suggestion, gap) Fix: Initialize `SysProcAttr` only if nil, then merge flags: ```go if cmd.SysProcAttr == nil { cmd.SysProcAttr = &windows.SysProcAttr{} } cmd.SysProcAttr.CreationFlags |= windows.CREATE_NEW_PROCESS_GROUP ``` 3. **Duplicated shutdown pattern in `stopInstance` and `handleRestartNeeded`** — `limavm_lifecycle.go:289-316` vs `limavm_lifecycle.go:472-490` [Claude Opus 4.6] Both functions implement the same signal-wait-forcestop pattern. The two implementations differ slightly (`handleRestartNeeded` re-inspects via `store.Inspect` before `stopInstanceForcibly`; `stopInstance` reuses its `inst` parameter). As both grow, they risk diverging. (suggestion, gap) Fix: Extract a `shutdownHostagent(ctx, name, inst)` helper that encapsulates the signal-or-kill-then-cleanup pattern. 4. **`killOrphanedHostagent` dropped graceful shutdown on Unix** — `limavm_lifecycle.go:522-527` [Claude Opus 4.6] ```go func (r *LimaVMReconciler) killOrphanedHostagent(ctx context.Context, inst *limatype.Instance) error { stopInstanceForcibly(ctx, inst) return nil } ``` The previous implementation tried `StopGracefully` (SIGTERM + 30s wait) before falling back to `StopForcibly`. The new code goes straight to hard kill. On Windows, the old graceful path was broken (`os.Process.Signal(os.Interrupt)` always fails), so this is a net improvement there. On Unix, orphaned hostagents lose their chance to shut down the VM driver gracefully. For post-crash recovery, speed matters more than cleanliness, so the trade-off is reasonable. (suggestion, regression) 5. **SSH key race between concurrent controller startups** — `ssh_keys_windows.go:25-42`, `limavm_controller.go:452-453` [Codex GPT 5.4] ```go if _, err := os.Stat(privPath); err == nil { return nil } // ... cmd := exec.CommandContext(ctx, "ssh-keygen", "-t", "ed25519", ...) ``` `ensureSSHKeys` has a check-then-act race: two controllers starting simultaneously against the same Lima config dir can both observe "key missing" and race `ssh-keygen`. Lima upstream uses `lockutil.WithDirLock` to serialize this. In practice, the window is narrow and `ssh-keygen` prompts to overwrite (failing without a tty), so the second instance would return an error and prevent startup. (suggestion, regression) Fix: Wrap the `ssh-keygen` call in Lima's `lockutil.WithDirLock(configDir, ...)`. --- ## Design Observations ### Concerns 1. **`stopInstanceForcibly` forks upstream cleanup semantics instead of overriding just the kill primitive** — [Codex GPT 5.4, Claude Opus 4.6] (in-scope) The new `stopInstanceForcibly` replaces the entire `StopForcibly` function rather than overriding just the Windows kill call. This already caused the `filepath.Glob` regression and the orphaned-children issue, and will drift further as Lima evolves (e.g., the additional-disk unlock at `stop.go:119-128`, which RDD doesn't currently use but might in the future). A better structure: keep one local helper for "Windows-safe kill PID" and call upstream `StopForcibly` for everything else, or copy upstream's cleanup logic faithfully when the kill primitive must differ. ### Strengths 1. **`CREATE_NEW_PROCESS_GROUP` + targeted `GenerateConsoleCtrlEvent` is the right approach** — [Claude Opus 4.6] (in-scope) — The fix isolates the hostagent into its own process group and targets `CTRL_BREAK_EVENT` at that group specifically. Correct and clean. 2. **SSH key pre-generation is a clean Lima workaround** — [Claude Opus 4.6] (in-scope) — `ensureSSHKeys` runs at controller startup, generates the keypair with a native Windows path, and is idempotent. Avoids patching Lima while keeping the fix local to RDD. 3. **BATS helpers provide good cross-platform abstraction** — [Claude Opus 4.6] (in-scope) — `assert_process_alive` and `process_kill` correctly handle the MSYS2-to-Win32 process boundary. The `vm_template.bash` module cleanly separates platform-specific template selection. --- ## Testing Assessment The BATS tests cover the core scenarios well: crash recovery, orphaned hostagent recovery, graceful shutdown, template changes, and restart — all now running on Windows via MSYS2. Untested scenarios, ranked by risk: 1. Forceful termination on Windows when hostagent has spawned child processes (e.g., `ssh.exe` port forwarders) — validates whether ports are released 2. `stopInstanceForcibly` cleanup when `inst.Dir` contains glob meta-characters 3. `shutdownAllHostagents` hard-kill fallback on WSL2 — verifies WSL2 distro cleanup 4. Deletion timeout/failure path on WSL2 — verifies finalizer behavior 5. Concurrent first-start of multiple Windows controllers sharing one Lima config dir (SSH key race) 6. `ensureSSHKeys` failure paths (ssh-keygen unavailable, timeout) --- ## Documentation Assessment The code changes are well-documented with inline comments explaining the "why" behind each platform-specific workaround. Commit messages are descriptive. No public API changes require external documentation updates. The deletion behavior change (removing finalizer after failed backend cleanup) is significant enough to warrant a comment in the LimaVM API docs or an operator-facing note, since it changes the failure mode from "stuck resource" to "orphaned WSL2 distro." --- ## Agent Performance Retro | Metric | Claude Opus 4.6 | Codex GPT 5.4 | Gemini 3.1 Pro | |---|---|---|---| | Duration | 10:52 | 3:43 | 11:42 | | Critical | 0 | 0 | 0 | | Important | 1 | 2 | 3 | | Suggestion | 3 | 1 | 1 | | Design observations | 4 | 1 | 1 | | False positives | 0 | 1 | 1 | | Unique insights | 2 | 1 | 2 | ### Claude Opus 4.6 **Unique contributions:** Identified the `shutdownAllHostagents` gap (every other path calls `stopInstanceForcibly` but this fallback doesn't), and the duplicated shutdown pattern between `stopInstance` and `handleRestartNeeded`. Provided four design observations with nuanced analysis. **Accuracy:** No false positives. Each finding was verified with specific line references and git blame analysis. **Depth:** Excellent. Traced the full lifecycle of every shutdown path, compared all callers of `stopInstanceForcibly`, and correctly identified the `killOrphanedHostagent` change as a reasonable trade-off rather than a bug. **Signal-to-noise:** High. Every finding was actionable and well-calibrated in severity. ### Codex GPT 5.4 **Unique contributions:** Identified the deletion finalizer trade-off as a regression, and flagged the missing additional-disk unlock from upstream `StopForcibly`. Traced into upstream Lima source code at specific file:line locations. **Accuracy:** The additional-disk unlock finding is technically accurate (upstream does unlock disks), but RDD does not use additional disks — no references exist in the codebase. The finding's practical impact is therefore near zero, making it a false positive for this review scope. The deletion finalizer finding is valid and well-argued. **Depth:** Good upstream tracing. Codex read Lima's `stop.go`, `sshutil.go`, and `qemu.go` to build its arguments. However, it did not verify whether RDD actually uses additional disks before flagging the omission. **Signal-to-noise:** Moderate. Two findings, one high-value (finalizer), one low-value (additional disks). ### Gemini 3.1 Pro **Unique contributions:** Identified the `filepath.Glob` regression (upstream uses `os.ReadDir` + `strings.HasSuffix`) and the orphaned child processes on Windows (`TerminateProcess` doesn't kill children). **Accuracy:** The `SysProcAttr` overwrite finding is technically valid but overstated as IMPORTANT — the only call site creates a fresh `exec.Cmd`, so overwrite is safe today. This is a defensive coding suggestion, not a current bug. The other two findings are accurate and actionable. **Depth:** Moderate. Gemini identified the correct upstream pattern for temp file cleanup and the Windows `TerminateProcess` behavior, but did not trace into the broader lifecycle (e.g., what happens on next startup when ports are held). The design observation about Windows hostagent blocking on `CTRL_BREAK_EVENT` is interesting but unsupported by evidence. **Signal-to-noise:** Good. Three findings, two high-value, one overstated. ### Overall Assessment All three agents found distinct, valuable issues. Claude provided the broadest analysis with the best severity calibration. Codex found the most architecturally significant issue (finalizer trade-off) by tracing into upstream code. Gemini found the most immediately actionable regressions (`filepath.Glob`, orphaned children). The multi-agent approach paid off: no single agent found more than 60% of the consolidated findings. --- ## Appendix: Original Reviews ### Claude Opus 4.6 | Dimension | Verdict | |-----------|---------| | **Merge** | Approve with suggestions | #### Executive Summary This branch fixes a critical Windows bug where Lima's `StopForcibly` used `GenerateConsoleCtrlEvent(CTRL_BREAK)` to kill hostagents, which targeted the entire console group and killed the RDD service along with the hostagent. The fix introduces `CREATE_NEW_PROCESS_GROUP` for hostagent processes and a custom `stopInstanceForcibly` that uses `os.Process.Kill` (TerminateProcess) targeted at individual processes. The BATS CI is ported from WSL2 to MSYS2, with appropriate cross-platform helpers. #### Important Issues 1. **`shutdownAllHostagents` hard-kill fallback doesn't clean up WSL2 distros or temp files** — `limavm_controller.go:509-513` (important, gap) ```go case <-time.After(gracefulShutdownTimeout): if state.cmd != nil && state.cmd.Process != nil { _ = state.cmd.Process.Kill() } <-state.procExited ``` Every other forced-shutdown path in this PR (`stopInstance` at line 479, `handleRestartNeeded` at line 301, `handleDeletion` at line 50, `killOrphanedHostagent` at line 525) calls `stopInstanceForcibly`, which terminates the WSL2 distro and cleans up `.pid`/`.sock`/`.tmp` files. This path — the kill fallback of `shutdownAllHostagents` — does `Process.Kill()` alone. On WSL2, the `keepAlive` nohup sleep keeps the distro running after the hostagent dies, so `rdd svc stop` with a stuck hostagent leaves WSL2 distros lingering. This gap is pre-existing (the old code also didn't call `StopForcibly` in this fallback), but the PR establishes `stopInstanceForcibly` as the canonical cleanup path everywhere else while leaving this one unchanged. The function currently has no `context.Context` parameter, which is needed for `store.Inspect` and `stopInstanceForcibly`. The graceful path (CTRL_BREAK → hostagent shuts down VM) handles cleanup in the common case; the kill fallback only fires when the hostagent hangs for 30 seconds. The next service start cleans up via `killOrphanedHostagent`. So this is a UX issue (stale distros visible to users), not data loss. Fix: Accept a `context.Context` in `shutdownAllHostagents`, and after `Process.Kill()` + `<-state.procExited`, call `store.Inspect` → `stopInstanceForcibly` for each killed hostagent. This could be a follow-up since it's a structural change. #### Suggestions 1. **`killOrphanedHostagent` lost graceful shutdown on Unix** — `limavm_lifecycle.go:522-527` (suggestion, regression) ```go func (r *LimaVMReconciler) killOrphanedHostagent(ctx context.Context, inst *limatype.Instance) error { stopInstanceForcibly(ctx, inst) return nil } ``` The old implementation tried `StopGracefully` (SIGTERM + 30s wait) before falling back to `StopForcibly`. The new code goes straight to hard kill. On Windows, the old graceful path was broken (`os.Process.Signal(os.Interrupt)` always fails on Windows), so this is a net improvement there. On Unix, orphaned hostagents lose their chance to shut down the VM gracefully (e.g., flushing QEMU disk writes). For RDD's use case, orphaned hostagents are a post-crash recovery scenario where speed matters more than cleanliness, so the trade-off is reasonable. Worth noting for future reference if VM data integrity becomes a concern. 2. **`stopInstanceForcibly` silently discards all errors** — `limavm_lifecycle.go:538-563` (suggestion, gap) ```go func stopInstanceForcibly(ctx context.Context, inst *limatype.Instance) { for _, pid := range []int{inst.DriverPID, inst.HostAgentPID} { if pid > 0 { if p, err := os.FindProcess(pid); err == nil { _ = p.Kill() } } } ``` Every error is discarded: `p.Kill()` at line 542, `exec.CommandContext(...).Run()` at line 553, `os.Remove(m)` at line 560. For best-effort cleanup this is acceptable, but a failed WSL2 `--terminate` or failed process kill would be invisible in logs. The function is standalone (not a reconciler method), so it has no logger. Fix: Accept a `logr.Logger` parameter and log at V(1) for failed cleanup steps. This pairs well with finding #1 (adding context). 3. **`stopInstance` and `handleRestartNeeded` share duplicated shutdown logic** — `limavm_lifecycle.go:289-316` vs `limavm_lifecycle.go:472-490` (suggestion, gap) Both functions implement the same pattern: signal → wait → force-stop (or: kill → wait → force-stop). The two implementations are structurally identical but differ slightly (handleRestartNeeded re-inspects via `store.Inspect` at line 310 before calling `stopInstanceForcibly`; stopInstance reuses the `inst` parameter at line 489). This duplication increases the risk of them diverging over time. Fix: Extract a `shutdownHostagent(ctx, name, inst)` helper that encapsulates the signal-or-kill-then-cleanup pattern. Both callers would use it instead of duplicating the logic. #### Design Observations 1. **`process.Interrupt` + `CREATE_NEW_PROCESS_GROUP` is the right approach** (in-scope) The PR correctly identifies that Lima's `osutil.SysKill` on Windows uses `GenerateConsoleCtrlEvent(CTRL_BREAK_EVENT, pid)` which targets the entire console group — killing the RDD service. The fix has two parts: `CREATE_NEW_PROCESS_GROUP` in `SetGroup` (line 21 of `process_windows.go`) isolates the hostagent into its own process group, and `process.Interrupt` targets that group specifically. Go's signal package maps CTRL_BREAK to `os.Interrupt`, so the hostagent's signal handlers fire correctly. This is a clean, correct solution. 2. **`handleDeletion` continuing past delete failure is a sound trade-off** (in-scope) The change at `limavm_lifecycle.go:58-63` to log-and-continue instead of returning an error on `limainstance.Delete` failure prevents resources from getting permanently stuck when WSL2's `wsl.exe --unregister` hangs. The 60-second timeout at line 55 bounds the worst case. The trade-off (orphaned WSL2 distros requiring manual cleanup) is well-documented in the comment and appropriate for a system where WSL2 degradation is an operational reality. 3. **SSH key pre-generation is a clean workaround** (in-scope) `ensureSSHKeys` in `ssh_keys_windows.go` runs at controller startup (via `SetupWithManager`) and generates the SSH keypair with a native Windows path before Lima tries to generate it with a broken cygpath-based path. The function is idempotent (checks for existing key at line 31), bounded by a 30-second timeout (line 37), and fails fast if ssh-keygen is unavailable. This avoids patching Lima while keeping the fix local to RDD. 4. **BATS helpers provide good cross-platform abstraction** (in-scope) `assert_process_alive` and `process_kill` in `os.bash` correctly handle the MSYS2 → Win32 process boundary using `tasklist.exe` and `taskkill.exe` with `MSYS_NO_PATHCONV=1`. The `vm_template.bash` module cleanly separates platform-specific template selection. The `editor_cmd` wrapper in `limavm-running.bats` handles MSYS2 path conversion for the EDITOR variable. #### Testing Assessment The BATS tests provide good coverage of the changed behavior: 1. **Covered**: Crash recovery (kill hostagent, verify restart), orphaned hostagent recovery (kill service, verify hostagent survives, restart service), graceful shutdown, template changes, restart mechanism — all now running on Windows via MSYS2. 2. **Partially covered**: Graceful shutdown hostagent death verification is skipped on Windows (line 334 in `limavm-running.bats`), but compensated by verifying the VM starts fresh in the next test (line 344). The comment explains the reason (wsl.exe handles keep PID visible). 3. **Not covered by integration tests** (ranked by risk): - `shutdownAllHostagents` hard-kill fallback on WSL2 (stuck hostagent → WSL2 distro left running) — difficult to trigger reliably in BATS - `ensureSSHKeys` failure path (ssh-keygen unavailable or timeout) — unit-testable - `handleDeletion` continuing past `limainstance.Delete` failure — would require simulating WSL2 degradation - `signalHostagent` returning false and fallback to `killHostagent` — partially covered by crash recovery test (hostagent is killed externally, but the signal-failure path in the controller is a different code path) #### Documentation Assessment The code changes are well-documented with inline comments explaining the "why" behind each platform-specific workaround. The commit messages are descriptive. No public API changes require external documentation updates. --- ### Codex GPT 5.4 Verdict: Needs changes **Executive Summary** This series moves Windows BATS coverage from WSL to MSYS2, reworks Lima hostagent signaling on Windows, pre-generates Lima SSH keys on Windows, and changes forcible stop/delete handling for WSL2-backed instances. The Windows process work is directionally right, but I found two important regressions in the new force-stop/delete path: deletion now drops the cleanup finalizer even when backend cleanup fails, and the replacement `stopInstanceForcibly()` silently stopped unlocking Lima additional disks. **Important Issues** 1. **Deletion now discards the only cleanup retry path after a failed backend delete** — `limavm_lifecycle.go:52-63` (important, regression) ```go deleteCtx, deleteCancel := context.WithTimeout(ctx, 60*time.Second) err = limainstance.Delete(deleteCtx, existingInst, true) deleteCancel() if err != nil { logger.Error(err, "Failed to delete Lima instance (proceeding with finalizer removal)") } ``` The new branch at line 56 deliberately continues after `limainstance.Delete(...)` fails at line 56, and the reconcile then proceeds to remove the cleanup finalizer via `base.RemoveCleanupFinalizer()` in `pkg/controllers/base/finalizer.go:48-55`. That is a real behavior change: the base finalizer is documented at `pkg/controllers/base/finalizer.go:25-27` as the mechanism that ensures Lima disk/file cleanup runs before the CR is deleted. Once line 63 logs and continues, there is no owner object left to retry cleanup, so a timed-out or failed `Delete` can strand the on-disk instance and WSL distro permanently. `git blame` attributes lines 53-63 to `1f0db09`, so this regression is introduced by this series. Fix: if `limainstance.Delete()` fails at line 56, keep the cleanup finalizer and return an error or `RequeueAfter` so deletion can be retried. If you need an escape hatch for degraded WSL, gate "force-remove finalizer anyway" behind an explicit user opt-in or a separate admin cleanup flow. 2. **The custom force-stop helper dropped Lima's additional-disk unlock step** — `limavm_lifecycle.go:538-562` (important, regression) ```go func stopInstanceForcibly(ctx context.Context, inst *limatype.Instance) { for _, pid := range []int{inst.DriverPID, inst.HostAgentPID} { if pid > 0 { if p, err := os.FindProcess(pid); err == nil { _ = p.Kill() } } } // ... } ``` The replacement `stopInstanceForcibly()` at lines 538-562 is now used from deletion, restart recovery, stop fallback, and orphan cleanup, but it only kills processes and removes tmp files. Lima's upstream `StopForcibly()` explicitly unlocks every `inst.AdditionalDisks` entry at `stop.go:119-128`, and `store.Inspect()` populates `inst.AdditionalDisks` at `store/instance.go:100`. That unlock is not optional: the next QEMU start path rejects still-locked disks at `driver/qemu/qemu.go:701-716`. `git blame` shows the new helper body was introduced in `be9c281`/`1f0db09`, so this is a direct regression from replacing upstream `StopForcibly()`. Fix: keep the Windows-specific process-kill behavior, but preserve the rest of Lima's force-stop semantics. The simplest fix is to copy the additional-disk unlock loop from upstream `StopForcibly()` into this helper, or refactor so only the signal/kill primitive is overridden while upstream cleanup logic stays intact. *Consolidation note: RDD does not currently use Lima additional disks (no references in the codebase). This finding is technically accurate about the upstream API contract but has no practical impact today. Downgraded to suggestion scope in the consolidated review.* **Suggestions** 1. **The Windows SSH-key workaround dropped Lima's directory lock, so multi-instance startup can race** — `limavm_controller.go:452-453`, `ssh_keys_windows.go:25-42` (suggestion, regression) ```go if err := ensureSSHKeys(); err != nil { return fmt.Errorf("failed to generate Lima SSH keys: %w", err) } ``` ```go if _, err := os.Stat(privPath); err == nil { return nil } cmd := exec.CommandContext(ctx, "ssh-keygen", ..., "-f", privPath) if out, err := cmd.CombinedOutput(); err != nil { return fmt.Errorf("failed to generate SSH key: %s: %w", out, err) } ``` `SetupWithManager()` now calls `ensureSSHKeys()` on every controller startup at lines 452-453, but `ensureSSHKeys()` at lines 30-42 is just a `Stat` plus `ssh-keygen`. Upstream Lima's `sshutil.DefaultPubKeys()` uses `lockutil.WithDirLock(configDir, ...)` at `pkg/sshutil/sshutil.go:109-125` specifically to serialize first-time key generation. This repo explicitly supports multiple RDD instances via `RDD_INSTANCE` in `pkg/instance/instance.go:15-37`, so two Windows controllers starting against the same Lima config dir can both observe "key missing" and race one another. `git blame` points both the setup call and helper implementation to `625737d`. Fix: wrap the new Windows-native `ssh-keygen` call in Lima's existing `lockutil.WithDirLock(configDir, ...)`, mirroring upstream behavior while keeping the native-path workaround. **Design Observations** 1. (in-scope) The new `stopInstanceForcibly()` forks upstream Lima cleanup semantics instead of just swapping out the Windows kill primitive. That already caused the missed additional-disk unlock, and it will keep drifting as Lima evolves. A better structure is to keep one local helper for "Windows-safe kill PID/process-group" and reuse the rest of Lima's force-stop flow unchanged; that removes a whole class of parity bugs. **Testing Assessment** 1. No test covers force-stopping or deleting a Lima VM that uses `additionalDisks`, so the disk-unlock regression above would not be caught. 2. No test exercises the delete-timeout/delete-failure path on WSL2, so the new "log and remove finalizer anyway" behavior is unverified. 3. No test covers concurrent first-start of multiple Windows RDD instances sharing one Lima config dir, so the new SSH-key generation path is untested under the race that upstream Lima locks against. **Documentation Assessment** The user-facing behavior change around failed WSL2 deletion is significant enough that it should not live only in comments. If the intent is really to allow CR deletion while backend cleanup fails, that needs explicit operator documentation and a supported manual cleanup procedure; otherwise the code should keep the existing retry semantics. --- ### Gemini 3.1 Pro #### Executive Summary This PR successfully migrates the Lima BATS integration tests to run via MSYS2 on Windows and addresses several Windows-specific lifecycle issues, including pre-generating SSH keys to sidestep MSYS2 path translation bugs and replacing Lima's forceful shutdown logic to avoid inadvertently killing the parent service. The primary findings revolve around edge cases introduced by the new lifecycle process handling: an API design gap that blindly overwrites `SysProcAttr`, a regression in how temporary files are cleaned up during forceful termination, and a risk of leaving orphaned `ssh.exe` port forwarders running in the background on Windows. #### Critical Issues None. #### Important Issues 1. **`SetGroup` blindly overwrites `SysProcAttr` instead of modifying it** — `process_windows.go:20-22` ```go func SetGroup(cmd *exec.Cmd) { cmd.SysProcAttr = &windows.SysProcAttr{CreationFlags: windows.CREATE_NEW_PROCESS_GROUP} } ``` The new cross-platform `SetGroup` function overwrites `cmd.SysProcAttr` entirely. If the caller had already configured other process attributes (e.g., `HideWindow` on Windows, or credentials on Unix), those settings would be silently discarded. This creates a brittle API that assumes it is the sole author of `SysProcAttr` properties. Fix: Modify the struct if it already exists, or initialize it if nil. Use bitwise operations where appropriate: ```go func SetGroup(cmd *exec.Cmd) { if cmd.SysProcAttr == nil { cmd.SysProcAttr = &windows.SysProcAttr{} } cmd.SysProcAttr.CreationFlags |= windows.CREATE_NEW_PROCESS_GROUP } ``` *(Apply similarly for `process_unix.go` lines 17-19 using `cmd.SysProcAttr.Setpgid = true`)* (important, gap) *Consolidation note: The only call site creates a fresh `exec.Cmd`, so the overwrite is safe today. Downgraded to suggestion in the consolidated review.* 2. **`stopInstanceForcibly` uses `filepath.Glob` which breaks on path meta-characters** — `limavm_lifecycle.go:550-555` ```go for _, suffix := range filenames.TmpFileSuffixes { matches, _ := filepath.Glob(filepath.Join(inst.Dir, "*"+suffix)) for _, m := range matches { _ = os.Remove(m) } } ``` The original `limainstance.StopForcibly` logic explicitly used `os.ReadDir` and `strings.HasSuffix`. The new implementation switches to `filepath.Glob`. If `inst.Dir` (often located in a user's home directory) contains glob meta-characters like `[` or `]`, `filepath.Glob` will fail with `ErrBadPattern` or silently return no matches. This will leave `.pid` and `.sock` files behind, which prevents the hostagent from successfully starting on the next attempt. Fix: Revert to the safer directory traversal approach originally used by Lima: ```go fi, err := os.ReadDir(inst.Dir) if err == nil { for _, f := range fi { for _, suffix := range filenames.TmpFileSuffixes { if strings.HasSuffix(f.Name(), suffix) { _ = os.Remove(filepath.Join(inst.Dir, f.Name())) } } } } ``` (important, regression) 3. **`stopInstanceForcibly` risks leaving orphaned child processes on Windows** — `limavm_lifecycle.go:535-541` ```go for _, pid := range []int{inst.DriverPID, inst.HostAgentPID} { if pid > 0 { if p, err := os.FindProcess(pid); err == nil { _ = p.Kill() } } } ``` The PR replaces Lima's `GenerateConsoleCtrlEvent(CTRL_BREAK_EVENT)` with Go's `p.Kill()`, which invokes `TerminateProcess` under the hood on Windows. While this successfully avoids killing the parent RDD service, `TerminateProcess` does not terminate child processes. Since Lima's hostagent spawns `ssh.exe` for port forwarding on Windows, forcefully killing the hostagent leaves these `ssh.exe` processes orphaned in the background. The surviving orphans will continue to hold locks on forwarded ports, triggering "address already in use" errors and blocking subsequent VM startups. Fix: For forceful termination on Windows, consider using `taskkill /F /T /PID ` to cleanly kill the process tree. Alternatively, document this limitation and explicitly clean up known child processes if they can be tracked. (important, regression) #### Suggestions 1. **Improve PID matching precision in `assert_process_alive`** — `os.bash:103-104` ```bash MSYS_NO_PATHCONV=1 tasklist.exe /FI "PID eq ${pid}" /NH 2>/dev/null | grep -q "${pid}" ``` While `tasklist` filters by the exact PID, using `grep -q "${pid}"` could technically match the PID substring inside the process name (e.g., matching "12" against `test12.exe`). Since `tasklist` outputs `INFO: No tasks are running...` when no match is found, this is functionally safe in practice, but adding the word boundary flag (`-w`) or extended regex (`-E "\b${pid}\b"`) prevents any theoretical edge cases. (suggestion, enhancement) #### Design Observations **Graceful shutdown synchronization on Windows (in-scope)** In `limavm-running.bats` (lines 332-337), you skip waiting for the hostagent PID to die on Windows, citing child process handles keeping the PID visible in the process table. In reality, Windows `tasklist` queries do not show zombie processes; if `tasklist` shows the PID, the process has not terminated. This implies the hostagent is genuinely blocking during its graceful shutdown phase (likely waiting on its `wsl.exe` driver or `ssh.exe` children that might ignore the `CTRL_BREAK_EVENT`), eventually forcing the service to hard-kill it during shutdown. Tracing why the hostagent blocks on `CTRL_BREAK_EVENT` on Windows could eliminate the need for this test workaround and improve overall shutdown reliability.