Deep Review: 20260320-133532-msys2-ci
| Date | 2026-03-20 13:35 |
| Branch | msys2-ci |
| Commits |
56b59d4 Honor RDD_LOG_LEVEL in external controllersedc2e9a Run Windows BATS tests via MSYS2 instead of WSL2349a276 Pre-generate Lima SSH keys on Windows to avoid broken path conversion6175f8d Enable Lima BATS tests on Windows7b1a401 Fix stopInstanceForcibly to terminate WSL2 distro and clean up files462fcf3 Fix Windows process signaling for Lima hostagent lifecycle
|
| Reviewers | Claude Codex Gemini |
| Verdict | Merge with fixes — two confirmed regressions in orphan recovery and additional-disk unlock need resolution before merge |
| Wall-clock time | 29 min 28 s |
Consolidated Review ¶
Executive Summary ¶
This branch adds Windows support for Lima hostagent lifecycle management: cross-platform process signaling (process.Interrupt, process.KillTree), SSH key pre-generation to work around MSYS2 path conversion, custom force-stop with WSL2 distro cleanup, BATS test migration from WSL2 to MSYS2, and RDD_LOG_LEVEL propagation to external controllers.
The Windows-specific work is well-designed, but the new stopInstanceForcibly helper replaced Lima's upstream StopForcibly on all platforms. This caused two regressions: orphan recovery lost its graceful shutdown phase, and additional-disk locks are no longer released after forced stops. The RDD_LOG_LEVEL feature silently ignores unrecognized levels.
Critical Issues ¶
None.
Important Issues ¶
}
switch inst.Status {
case limatype.StatusRunning, limatype.StatusBroken:
// Orphaned hostagent from before controller restart. Kill it so the
// next reconcile can start with a watcher.
logger.Info("Found orphaned hostagent, killing it", "status", inst.Status)
if err := r.killOrphanedHostagent(ctx, inst); err != nil {
logger.Error(err, "Failed to kill orphaned hostagent")
return ctrl.Result{}, err
}
return ctrl.Result{RequeueAfter: time.Second}, nil
// killOrphanedHostagent terminates an orphaned hostagent (one running without a
// watcher, typically after a controller restart).
func (r *LimaVMReconciler) killOrphanedHostagent(ctx context.Context, inst *limatype.Instance) error {
stopInstanceForcibly(ctx, log.FromContext(ctx), inst)
return nil
}
Before this branch, killOrphanedHostagent attempted limainstance.StopGracefully with a 30-second timeout before falling back to limainstance.StopForcibly. The new code goes straight to stopInstanceForcibly, which sends SIGKILL without a graceful phase. Every controller restart with a running orphan now hard-kills the guest, risking filesystem corruption.
Verified via git diff upstream/main...HEAD: the old implementation called StopGracefully then StopForcibly; the new code calls only stopInstanceForcibly.
Fix: restore graceful-then-force for orphan recovery. On Windows, where StopGracefully cannot deliver signals to orphaned processes, skip to force-stop.
func stopInstanceForcibly(ctx context.Context, logger logr.Logger, inst *limatype.Instance) {
for _, pid := range []int{inst.DriverPID, inst.HostAgentPID} {
if pid > 0 {
if err := process.KillTree(ctx, pid); err != nil {
logger.V(1).Info("Failed to kill process tree", "pid", pid, "error", err)
}
}
}
// On WSL2, terminate the distro so store.Inspect reports StatusStopped.
if inst.VMType == limatype.WSL2 {
distroName := "lima-" + inst.Name
// Best-effort with a timeout: wsl.exe can hang if the WSL
// subsystem is degraded; don't block the caller.
wslCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
defer cancel()
if err := exec.CommandContext(wslCtx, "wsl.exe", "--terminate", distroName).Run(); err != nil {
logger.V(1).Info("Failed to terminate WSL2 distro", "distro", distroName, "error", err)
}
}
Lima's upstream StopForcibly (at lima/pkg/instance/stop.go:119-129) iterates inst.AdditionalDisks, inspects each disk, and calls disk.Unlock(). The new stopInstanceForcibly omits this step entirely. After any forced stop or delete, in_use_by locks remain, blocking later reuse of those disks.
Fix: port the additional-disk unlock loop from Lima's StopForcibly into stopInstanceForcibly, after process cleanup and before returning.
// setVerbosityFromEnv sets klog verbosity from RDD_LOG_LEVEL as a default.
// Called before flag.Parse() so that an explicit -v flag on the command line
// takes precedence.
func setVerbosityFromEnv() {
level := strings.TrimSpace(os.Getenv("RDD_LOG_LEVEL"))
switch strings.ToLower(level) {
case "trace":
_ = flag.Set("v", "4")
case "debug":
_ = flag.Set("v", "2")
}
}
The RDD CLI supports six log levels; external controllers handle only trace and debug. Setting RDD_LOG_LEVEL=info or RDD_LOG_LEVEL=warn is silently ignored — operators expect quieter output and get none. klog's numeric verbosity can only increase detail, so mapping warn/error to lower output is impossible through -v alone.
Fix: add a default case that logs when a non-empty, unrecognized level is set, so operators see why their setting has no effect:
default:
if level != "" {
klog.Infof("RDD_LOG_LEVEL=%q not supported for klog (only trace, debug); using default verbosity", level)
}
@test "graceful service stop terminates hostagent" {
local ha_pid
ha_pid=$(get_hostagent_pid "${VM_NAME}")
assert [ -n "${ha_pid}" ]
assert_process_alive "${ha_pid}"
rdd svc stop
# On Windows, child process handles (wsl.exe) keep the hostagent PID visible
# in the process table long after termination. The next test ("restart service
# after graceful shutdown") verifies the hostagent is functionally dead.
if is_unix; then
try --max 15 --delay 1 --until-fail -- assert_process_alive "${ha_pid}"
fi
}
The "graceful service stop terminates hostagent" test skips its key assertion on Windows. The follow-up test succeeds regardless of whether graceful shutdown or orphan recovery cleaned up the hostagent. On Windows, process.Kill calls TerminateProcess, which bypasses Go's signal handlers — shutdownAllHostagents never runs. The test cannot distinguish the two paths.
Fix: add a Windows-specific check (e.g., query rdd ctl get limavm status) to verify the hostagent was stopped by the service, not by orphan recovery. Or document this as a known Windows limitation and remove the misleading comment in the next test.
func ensureSSHKeys() error {
configDir, err := dirnames.LimaConfigDir()
if err != nil {
return err
}
privPath := filepath.Join(configDir, filenames.UserPrivateKey)
if _, err := os.Stat(privPath); err == nil {
return nil
}
if err := os.MkdirAll(configDir, 0o700); err != nil {
return fmt.Errorf("could not create %q: %w", configDir, err)
}
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
cmd := exec.CommandContext(ctx, "ssh-keygen", "-t", "ed25519", "-q", "-N", "",
"-C", "lima", "-f", privPath)
if out, err := cmd.CombinedOutput(); err != nil {
return fmt.Errorf("failed to generate SSH key: %s: %w", out, err)
}
return nil
}
If ssh-keygen crashes or is interrupted, it may leave a partial private key. On the next run, os.Stat(privPath) succeeds, ensureSSHKeys returns nil, and Lima fails to start with a corrupt key. Recovery requires manual deletion.
Fix: generate into a temporary path and rename on success, or verify the keypair is usable before returning.
// gracefulShutdownTimeout is the time to wait for a hostagent to exit
// after sending SIGINT before falling back to SIGKILL.
gracefulShutdownTimeout = 30 * time.Second
)
This comment references Unix signal names, but the code now uses process.Interrupt (CTRL_BREAK on Windows) and process.KillTree (taskkill on Windows). The comment predates this PR but describes behavior this PR changed.
Fix: update to platform-agnostic language: "time to wait for a hostagent to exit after requesting graceful shutdown before falling back to forced termination."
Suggestions ¶
- name: "Windows: write MSYS2 wrapper script"
if: runner.os == 'Windows'
shell: cmd /c copy {0} run-in-wsl.cmd
working-directory: ${{ runner.temp }}/bin
run: |
SET MSYSTEM=UCRT64
SET CHERE_INVOKING=1
SET MSYS2_PATH_TYPE=inherit
C:\msys64\usr\bin\bash.exe --login -e %1
EXIT %ERRORLEVEL%
The wrapper invokes MSYS2, not WSL, but the file is still named run-in-wsl.cmd. Future maintainers will find this confusing.
Fix: rename to run-in-msys.cmd and update the shell: references. (The name is also used on Linux/macOS as run-in-wsl, so a broader rename to run-in-shell may be cleaner.)
# On Windows, rdd.exe can't execute MSYS2 paths directly. Use "bash <path>"
# so the editor command is resolved by bash, and convert the path for Win32.
editor_cmd() {
local script=$1
if is_msys; then
echo "bash $(cygpath -m "${script}")"
else
echo "${script}"
fi
}
If the temp directory contains spaces, the unquoted path breaks. Currently safe on CI, but fragile.
Fix: echo "bash '$(cygpath -m "${script}")'".
privPath := filepath.Join(configDir, filenames.UserPrivateKey)
if _, err := os.Stat(privPath); err == nil {
return nil
}
if err := os.MkdirAll(configDir, 0o700); err != nil {
return fmt.Errorf("could not create %q: %w", configDir, err)
If the private key exists but the public key was deleted or corrupted, the function returns nil. Lima fails later with a less obvious error.
Fix: also stat privPath + ".pub" before returning early.
for _, f := range entries {
for _, suffix := range filenames.TmpFileSuffixes {
if strings.HasSuffix(f.Name(), suffix) {
path := filepath.Join(inst.Dir, f.Name())
if err := os.Remove(path); err != nil {
logger.V(1).Info("Failed to remove tmp file", "path", path, "error", err)
}
}
}
}
Failures are logged but successes are not. When debugging stale-file issues, knowing which files were removed is valuable.
Fix: log successful removals at V(1).
# Check if a process is alive. Returns 0 if alive, non-zero otherwise.
assert_process_alive() {
local pid=$1
if is_windows; then
MSYS_NO_PATHCONV=1 tasklist.exe /FI "PID eq ${pid}" /NH 2>/dev/null | grep -q "${pid}"
else
kill -0 "${pid}" 2>/dev/null
fi
}
The grep matches the PID string anywhere in the tasklist output line. If the PID appears in an image name or window title, this produces a false positive. Unlikely for short numeric PIDs.
Fix: anchor the match: grep -qw "${pid}".
func setVerbosityFromEnv() {
level := strings.TrimSpace(os.Getenv("RDD_LOG_LEVEL"))
switch strings.ToLower(level) {
case "trace":
_ = flag.Set("v", "4")
case "debug":
_ = flag.Set("v", "2")
}
}
flag.Set errors are suppressed. Unlikely to fail for a registered flag, but a stderr warning would aid diagnosis.
Fix: log an error if flag.Set returns non-nil.
Design Observations ¶
Concerns ¶
The new helper exists because Lima's StopForcibly sends CTRL_BREAK to the entire console group on Windows, killing the RDD service. But by replacing limainstance.StopForcibly on all platforms, the change lost graceful orphan recovery and additional-disk unlock on Unix/macOS. A better split: call Lima's upstream StopGracefully/StopForcibly on Unix where they work correctly, and use the custom stopInstanceForcibly only on Windows where the upstream behavior is broken.
On Windows, process.Kill calls TerminateProcess — an immediate, non-catchable kill. Go's signal handlers never fire, so shutdownAllHostagents never runs. Hostagents survive as orphans until the next service start. On Unix, process.Kill sends SIGTERM, which triggers graceful shutdown. This asymmetry means the graceful shutdown path added by this PR is exercised only on Unix. A possible fix: change process.Kill on Windows to send CTRL_BREAK_EVENT first (now that CREATE_NEW_PROCESS_GROUP is set up), wait for graceful exit, then fall back to TerminateProcess.
Strengths ¶
- The
processpackage abstraction (SetGroup,Interrupt,KillTree) cleanly separates platform concerns from controller logic. The reconciler never mentions signals or Windows APIs. Claude - Pre-generating SSH keys during controller setup contains the MSYS2/OpenSSH path-conversion workaround in one place, avoiding scattered Windows-specific keygen logic through the start path. Claude Codex
- Using a fresh context for WSL2 cleanup during manager shutdown avoids inheriting the already-cancelled manager context — a subtle detail that prevents cleanup failures. Claude Codex
- The
forceStopclosure inshutdownHostagentthat lazily resolvesinstviastore.Inspecthandles both callers with an instance handle and those without. Claude - Extracting VM templates into
vm_template.bashwithRDD_WSL_DISTROselection eliminates duplication and makes Windows template selection configurable. Claude
Testing Assessment ¶
Test coverage is good overall. The BATS tests exercise the full lifecycle: create, start, crash recovery, orphan recovery, graceful shutdown, restart, template changes, and editor commands. The Windows adaptations (assert_process_alive, process_kill, editor_cmd, WSL2 distro cleanup in teardown) are sensible.
Untested scenarios, ranked by risk:
- Forced stop with additional disks — no test covers whether
in_use_bylocks are released after forced stop, delete, or crash recovery. This is the highest-risk gap given confirmed finding #2. - Orphan recovery graceful shutdown — no test asserts that orphan recovery on Unix attempts graceful shutdown before force-killing.
- Windows graceful shutdown path — the test cannot distinguish graceful shutdown from orphan recovery on Windows (finding #4).
- SSH key generation failure — no test covers what happens when
ssh-keygenfails midway or is missing from PATH. RDD_LOG_LEVELfor external controllers — no test verifies that external controllers honor the environment variable. Existing log-level tests cover only the RDD CLI.- WSL2 distro cleanup timeout —
stopInstanceForciblyuses a 10-second timeout forwsl.exe --terminate. No test exercises the timeout path.
Documentation Assessment ¶
No documentation gaps. The vm_template.bash header comment documents RDD_WSL_DISTRO usage. The stopInstanceForcibly godoc explains the motivation. The setVerbosityFromEnv comment explains the flag.Parse ordering.
Agent Performance Retro ¶
Claude ¶
Claude Opus 4.6 produced the most thorough review. Its six findings (3 important, 4 suggestions) were all accurate with no false positives. Claude uniquely identified the stale comment on gracefulShutdownTimeout, the Windows test coverage gap, editor_cmd quoting, assert_process_alive matching, ensureSSHKeys public key check, and the missing file-removal logging. Its design observation about Windows rdd svc stop bypassing graceful shutdown was the deepest architectural insight — no other agent caught this asymmetry.
Claude did not catch the two most impactful regressions (lost graceful orphan recovery, missing disk unlock). Its classification of setVerbosityFromEnv as "regression" rather than "gap" is debatable — the feature is new to this branch, so a missing case is closer to a gap than a regression in existing behavior.
Codex ¶
Codex GPT 5.4 identified both confirmed regressions: the lost graceful shutdown in killOrphanedHostagent and the missing additional-disk unlock. These are the two findings most likely to cause real-world problems. Codex's design observation about the shared helper widening blast radius correctly identified the root cause of both issues.
Codex's review was the shortest (3 findings) but had the highest signal-to-noise ratio. Every finding was confirmed and actionable. It did not identify any suggestions or minor gaps — its focus was narrowly on correctness regressions.
Gemini ¶
Gemini 3.1 Pro raised two findings as CRITICAL, both of which are false positives after verification:
- The
KillTreeindefinite hang: the<-state.procExitedwait has a timeout guard inshutdownAllHostagents(lines 505-514), and SIGKILL guarantees process exit. - The
KillTreefails for non-group-leader drivers: Lima spawns the driver withSetpgid: true(verified inlima/pkg/driver/qemu/qemu_driver.go:295), so the driver is a process group leader.
Gemini's important findings were better: the ensureSSHKeys TOCTOU race is valid though the TOCTOU framing overstates the risk (concurrent controllers are unlikely; the real risk is interrupted ssh-keygen). The run-in-wsl.cmd naming suggestion was unique and valid.
Summary ¶
| Metric | Claude | Codex | Gemini |
|---|---|---|---|
| Duration | 8:20 | 4:06 | 5:24 |
| Critical | 0 | 0 | 0 (2 false positives) |
| Important | 3 | 2 | 2 (1 overstated) |
| Suggestion | 4 | 1 | 2 |
| Design observations | 2 | 1 | 1 |
| False positives | 0 | 0 | 2 |
| Unique insights | 5 | 2 | 1 |
Codex provided the most critical value — it found the two regressions no other agent caught. Claude provided the broadest coverage with no false positives and the deepest architectural insight. Gemini's false positives lowered its signal quality, though its TOCTOU finding (reframed as partial-key risk) and naming suggestion added value.
Skill Improvement Recommendations ¶
- Add a pre-flight check for SSH session restrictions — The skill should verify that the environment allows running review agents before launching them. When running in an SSH session, wrapper scripts may block agent CLIs (e.g., when
SSH_AUTH_SOCKorGH_TOKENis set). Add a step after the CLI availability check that tests each agent can actually produce output, or at minimum warns whenSSH_CONNECTIONis set.