# Deep Review: 20260320-133532-msys2-ci

| | |
|---|---|
| **Date** | 2026-03-20 13:35 |
| **Branch** | `msys2-ci` |
| **Commits** | `56b59d4` Honor RDD_LOG_LEVEL in external controllers<br>`edc2e9a` Run Windows BATS tests via MSYS2 instead of WSL2<br>`349a276` Pre-generate Lima SSH keys on Windows to avoid broken path conversion<br>`6175f8d` Enable Lima BATS tests on Windows<br>`7b1a401` Fix stopInstanceForcibly to terminate WSL2 distro and clean up files<br>`462fcf3` Fix Windows process signaling for Lima hostagent lifecycle |
| **Reviewers** | Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro |
| **Verdict** | **Merge with fixes** — two confirmed regressions in orphan recovery and additional-disk unlock need resolution before merge |
| **Wall-clock time** | 29 min 28 s |

---

## Consolidated Review

### Executive Summary

This branch adds Windows support for Lima hostagent lifecycle management: cross-platform process signaling (`process.Interrupt`, `process.KillTree`), SSH key pre-generation to work around MSYS2 path conversion, custom force-stop with WSL2 distro cleanup, BATS test migration from WSL2 to MSYS2, and `RDD_LOG_LEVEL` propagation to external controllers.

The Windows-specific work is well-designed, but the new `stopInstanceForcibly` helper replaced Lima's upstream `StopForcibly` on all platforms. This caused two regressions: orphan recovery lost its graceful shutdown phase, and additional-disk locks are no longer released after forced stops. The `RDD_LOG_LEVEL` feature silently ignores unrecognized levels.

### Critical Issues

None.

### Important Issues

1. **Orphan recovery lost graceful shutdown phase** — `limavm_lifecycle.go:253-257`, `limavm_lifecycle.go:516-517` [Codex GPT 5.4]

```go
func (r *LimaVMReconciler) killOrphanedHostagent(ctx context.Context, inst *limatype.Instance) error {
    stopInstanceForcibly(ctx, log.FromContext(ctx), inst)
    return nil
}
```

Before this branch, `killOrphanedHostagent` attempted `limainstance.StopGracefully` with a 30-second timeout before falling back to `limainstance.StopForcibly`. The new code goes straight to `stopInstanceForcibly`, which sends SIGKILL without a graceful phase. Every controller restart with a running orphan now hard-kills the guest, risking filesystem corruption.

Verified via `git diff upstream/main...HEAD`: the old implementation at lines 516-517 called `StopGracefully` then `StopForcibly`; the new code calls only `stopInstanceForcibly`.

Fix: restore graceful-then-force for orphan recovery. On Windows, where `StopGracefully` cannot deliver signals to orphaned processes, skip to force-stop.

2. **`stopInstanceForcibly` dropped additional-disk unlock** — `limavm_lifecycle.go:532-569` [Codex GPT 5.4]

```go
func stopInstanceForcibly(ctx context.Context, logger logr.Logger, inst *limatype.Instance) {
    for _, pid := range []int{inst.DriverPID, inst.HostAgentPID} {
        if pid > 0 {
            if err := process.KillTree(ctx, pid); err != nil {
                logger.V(1).Info("Failed to kill process tree", "pid", pid, "error", err)
            }
        }
    }
    // ... tmp file cleanup, WSL2 distro cleanup
    // No additional-disk unlock
}
```

Lima's upstream `StopForcibly` (at `lima/pkg/instance/stop.go:119-129`) iterates `inst.AdditionalDisks`, inspects each disk, and calls `disk.Unlock()`. The new `stopInstanceForcibly` omits this step entirely. After any forced stop or delete, `in_use_by` locks remain, blocking later reuse of those disks.

Fix: port the additional-disk unlock loop from Lima's `StopForcibly` into `stopInstanceForcibly`, after process cleanup and before returning.

3. **`setVerbosityFromEnv` silently ignores unrecognized levels** — `pkg/external/main.go:260-268` [Claude Opus 4.6, Codex GPT 5.4]

```go
func setVerbosityFromEnv() {
    level := strings.TrimSpace(os.Getenv("RDD_LOG_LEVEL"))
    switch strings.ToLower(level) {
    case "trace":
        _ = flag.Set("v", "4")
    case "debug":
        _ = flag.Set("v", "2")
    }
}
```

The RDD CLI supports six log levels; external controllers handle only `trace` and `debug`. Setting `RDD_LOG_LEVEL=info` or `RDD_LOG_LEVEL=warn` is silently ignored — operators expect quieter output and get none. klog's numeric verbosity can only increase detail, so mapping `warn`/`error` to lower output is impossible through `-v` alone.

Fix: add a `default` case that logs when a non-empty, unrecognized level is set, so operators see why their setting has no effect:
```go
default:
    if level != "" {
        klog.Infof("RDD_LOG_LEVEL=%q not supported for klog (only trace, debug); using default verbosity", level)
    }
```

4. **Graceful shutdown test skips core verification on Windows** — `limavm-running.bats:331-336` [Claude Opus 4.6]

```bash
# On Windows, child process handles (wsl.exe) keep the hostagent PID visible
# in the process table long after termination.
if is_unix; then
    try --max 15 --delay 1 --until-fail -- assert_process_alive "${ha_pid}"
fi
```

The "graceful service stop terminates hostagent" test skips its key assertion on Windows. The follow-up test succeeds regardless of whether graceful shutdown or orphan recovery cleaned up the hostagent. On Windows, `process.Kill` calls `TerminateProcess`, which bypasses Go's signal handlers — `shutdownAllHostagents` never runs. The test cannot distinguish the two paths.

Fix: add a Windows-specific check (e.g., query `rdd ctl get limavm` status) to verify the hostagent was stopped by the service, not by orphan recovery. Or document this as a known Windows limitation and remove the misleading comment in the next test.

5. **Partial key file from failed `ssh-keygen` blocks future attempts** — `ssh_keys_windows.go:31-43` [Gemini 3.1 Pro]

```go
if _, err := os.Stat(privPath); err == nil {
    return nil
}
// ...
cmd := exec.CommandContext(ctx, "ssh-keygen", "-t", "ed25519", "-q", "-N", "",
    "-C", "lima", "-f", privPath)
```

If `ssh-keygen` crashes or is interrupted, it may leave a partial private key. On the next run, `os.Stat(privPath)` succeeds, `ensureSSHKeys` returns nil, and Lima fails to start with a corrupt key. Recovery requires manual deletion.

Fix: generate into a temporary path and rename on success, or verify the keypair is usable before returning.

6. **Stale comment on `gracefulShutdownTimeout`** — `limavm_controller.go:84-85` [Claude Opus 4.6]

```go
// gracefulShutdownTimeout is the time to wait for a hostagent to exit
// after sending SIGINT before falling back to SIGKILL.
gracefulShutdownTimeout = 30 * time.Second
```

This comment references Unix signal names, but the code now uses `process.Interrupt` (CTRL_BREAK on Windows) and `process.KillTree` (taskkill on Windows). The comment predates this PR but describes behavior this PR changed.

Fix: update to platform-agnostic language: "time to wait for a hostagent to exit after requesting graceful shutdown before falling back to forced termination."

### Suggestions

1. **`run-in-wsl.cmd` name is misleading after MSYS2 migration** — `.github/workflows/bats.yaml:124-132` [Gemini 3.1 Pro]

```yaml
- name: "Windows: write MSYS2 wrapper script"
  if: runner.os == 'Windows'
  shell: cmd /c copy {0} run-in-wsl.cmd
```

The wrapper invokes MSYS2, not WSL, but the file is still named `run-in-wsl.cmd`. Future maintainers will find this confusing.

Fix: rename to `run-in-msys.cmd` and update the `shell:` references. (The name is also used on Linux/macOS as `run-in-wsl`, so a broader rename to `run-in-shell` may be cleaner.)

2. **`editor_cmd` does not quote the converted path** — `limavm-running.bats:88` [Claude Opus 4.6]

```bash
editor_cmd() {
    local script=$1
    if is_msys; then
        echo "bash $(cygpath -m "${script}")"
    else
        echo "${script}"
    fi
}
```

If the temp directory contains spaces, the unquoted path breaks. Currently safe on CI, but fragile.

Fix: `echo "bash '$(cygpath -m "${script}")'"`.

3. **`ensureSSHKeys` checks only private key existence** — `ssh_keys_windows.go:31-33` [Claude Opus 4.6]

```go
if _, err := os.Stat(privPath); err == nil {
    return nil
}
```

If the public key is missing or corrupted while the private key exists, the function returns nil. Lima fails later with a less obvious error.

Fix: also stat `privPath + ".pub"` before returning early.

4. **`stopInstanceForcibly` does not log successful file removals** — `limavm_lifecycle.go:559-566` [Claude Opus 4.6]

```go
if err := os.Remove(path); err != nil {
    logger.V(1).Info("Failed to remove tmp file", "path", path, "error", err)
}
```

Failures are logged but successes are not. When debugging stale-file issues, knowing which files were removed is valuable.

Fix: log successful removals at V(1).

5. **`assert_process_alive` could false-positive on Windows** — `os.bash:103` [Claude Opus 4.6]

```bash
MSYS_NO_PATHCONV=1 tasklist.exe /FI "PID eq ${pid}" /NH 2>/dev/null | grep -q "${pid}"
```

The grep matches the PID string anywhere in the tasklist output line. If the PID appears in an image name or window title, this produces a false positive. Unlikely for short numeric PIDs.

Fix: anchor the match: `grep -qw "${pid}"`.

6. **Ignored `flag.Set` errors in `setVerbosityFromEnv`** — `pkg/external/main.go:264-267` [Gemini 3.1 Pro]

```go
case "trace":
    _ = flag.Set("v", "4")
```

`flag.Set` errors are suppressed. Unlikely to fail for a registered flag, but a stderr warning would aid diagnosis.

Fix: log an error if `flag.Set` returns non-nil.

---

## Design Observations

### Concerns

1. **`stopInstanceForcibly` shared cross-platform widens blast radius** `(in-scope)` [Codex GPT 5.4, Claude Opus 4.6]

The new helper exists because Lima's `StopForcibly` sends `CTRL_BREAK` to the entire console group on Windows, killing the RDD service. But by replacing `limainstance.StopForcibly` on all platforms, the change lost graceful orphan recovery and additional-disk unlock on Unix/macOS. A better split: call Lima's upstream `StopGracefully`/`StopForcibly` on Unix where they work correctly, and use the custom `stopInstanceForcibly` only on Windows where the upstream behavior is broken.

2. **Windows `rdd svc stop` bypasses graceful hostagent shutdown** `(future)` [Claude Opus 4.6]

On Windows, `process.Kill` calls `TerminateProcess` — an immediate, non-catchable kill. Go's signal handlers never fire, so `shutdownAllHostagents` never runs. Hostagents survive as orphans until the next service start. On Unix, `process.Kill` sends SIGTERM, which triggers graceful shutdown. This asymmetry means the graceful shutdown path added by this PR is exercised only on Unix. A possible fix: change `process.Kill` on Windows to send `CTRL_BREAK_EVENT` first (now that `CREATE_NEW_PROCESS_GROUP` is set up), wait for graceful exit, then fall back to `TerminateProcess`.

### Strengths

- The `process` package abstraction (`SetGroup`, `Interrupt`, `KillTree`) cleanly separates platform concerns from controller logic. The reconciler never mentions signals or Windows APIs. [Claude Opus 4.6]
- Pre-generating SSH keys during controller setup contains the MSYS2/OpenSSH path-conversion workaround in one place, avoiding scattered Windows-specific keygen logic through the start path. [Claude Opus 4.6, Codex GPT 5.4]
- Using a fresh context for WSL2 cleanup during manager shutdown avoids inheriting the already-cancelled manager context — a subtle detail that prevents cleanup failures. [Claude Opus 4.6, Codex GPT 5.4]
- The `forceStop` closure in `shutdownHostagent` that lazily resolves `inst` via `store.Inspect` handles both callers with an instance handle and those without. [Claude Opus 4.6]
- Extracting VM templates into `vm_template.bash` with `RDD_WSL_DISTRO` selection eliminates duplication and makes Windows template selection configurable. [Claude Opus 4.6]

---

## Testing Assessment

Test coverage is good overall. The BATS tests exercise the full lifecycle: create, start, crash recovery, orphan recovery, graceful shutdown, restart, template changes, and editor commands. The Windows adaptations (`assert_process_alive`, `process_kill`, `editor_cmd`, WSL2 distro cleanup in teardown) are sensible.

Untested scenarios, ranked by risk:

1. **Forced stop with additional disks** — no test covers whether `in_use_by` locks are released after forced stop, delete, or crash recovery. This is the highest-risk gap given confirmed finding #2.
2. **Orphan recovery graceful shutdown** — no test asserts that orphan recovery on Unix attempts graceful shutdown before force-killing.
3. **Windows graceful shutdown path** — the test cannot distinguish graceful shutdown from orphan recovery on Windows (finding #4).
4. **SSH key generation failure** — no test covers what happens when `ssh-keygen` fails midway or is missing from PATH.
5. **`RDD_LOG_LEVEL` for external controllers** — no test verifies that external controllers honor the environment variable. Existing log-level tests cover only the RDD CLI.
6. **WSL2 distro cleanup timeout** — `stopInstanceForcibly` uses a 10-second timeout for `wsl.exe --terminate`. No test exercises the timeout path.

---

## Documentation Assessment

No documentation gaps. The `vm_template.bash` header comment documents `RDD_WSL_DISTRO` usage. The `stopInstanceForcibly` godoc explains the motivation. The `setVerbosityFromEnv` comment explains the flag.Parse ordering.

---

## Agent Performance Retro

### [Claude]

Claude Opus 4.6 produced the most thorough review. Its six findings (3 important, 4 suggestions) were all accurate with no false positives. Claude uniquely identified the stale comment on `gracefulShutdownTimeout`, the Windows test coverage gap, `editor_cmd` quoting, `assert_process_alive` matching, `ensureSSHKeys` public key check, and the missing file-removal logging. Its design observation about Windows `rdd svc stop` bypassing graceful shutdown was the deepest architectural insight — no other agent caught this asymmetry.

Claude did not catch the two most impactful regressions (lost graceful orphan recovery, missing disk unlock). Its classification of `setVerbosityFromEnv` as "regression" rather than "gap" is debatable — the feature is new to this branch, so a missing case is closer to a gap than a regression in existing behavior.

### [Codex]

Codex GPT 5.4 identified both confirmed regressions: the lost graceful shutdown in `killOrphanedHostagent` and the missing additional-disk unlock. These are the two findings most likely to cause real-world problems. Codex's design observation about the shared helper widening blast radius correctly identified the root cause of both issues.

Codex's review was the shortest (3 findings) but had the highest signal-to-noise ratio. Every finding was confirmed and actionable. It did not identify any suggestions or minor gaps — its focus was narrowly on correctness regressions.

### [Gemini]

Gemini 3.1 Pro raised two findings as CRITICAL, both of which are false positives after verification:
- The `KillTree` indefinite hang: the `<-state.procExited` wait has a timeout guard in `shutdownAllHostagents` (lines 505-514), and SIGKILL guarantees process exit.
- The `KillTree` fails for non-group-leader drivers: Lima spawns the driver with `Setpgid: true` (verified in `lima/pkg/driver/qemu/qemu_driver.go:295`), so the driver is a process group leader.

Gemini's important findings were better: the `ensureSSHKeys` TOCTOU race is valid though the TOCTOU framing overstates the risk (concurrent controllers are unlikely; the real risk is interrupted ssh-keygen). The `run-in-wsl.cmd` naming suggestion was unique and valid.

### Summary

| Metric | [Claude] | [Codex] | [Gemini] |
|---|---|---|---|
| Duration | 8:20 | 4:06 | 5:24 |
| Critical | 0 | 0 | 0 (2 false positives) |
| Important | 3 | 2 | 2 (1 overstated) |
| Suggestion | 4 | 1 | 2 |
| Design observations | 2 | 1 | 1 |
| False positives | 0 | 0 | 2 |
| Unique insights | 5 | 2 | 1 |

Codex provided the most critical value — it found the two regressions no other agent caught. Claude provided the broadest coverage with no false positives and the deepest architectural insight. Gemini's false positives lowered its signal quality, though its TOCTOU finding (reframed as partial-key risk) and naming suggestion added value.

---

## Skill Improvement Recommendations

1. **Add a pre-flight check for SSH session restrictions** — The skill should verify that the environment allows running review agents before launching them. When running in an SSH session, wrapper scripts may block agent CLIs (e.g., when `SSH_AUTH_SOCK` or `GH_TOKEN` is set). Add a step after the CLI availability check that tests each agent can actually produce output, or at minimum warns when `SSH_CONNECTION` is set.

---

## Appendix: Original Reviews

### Claude Opus 4.6

#### Executive Summary

This branch adds cross-platform process signaling for Lima hostagent lifecycle management on Windows, pre-generates SSH keys to work around Lima's broken cygpath conversion, replaces Lima's `StopForcibly` with a custom implementation that handles WSL2 distro cleanup, migrates CI from WSL2 to MSYS2 for BATS tests, and propagates `RDD_LOG_LEVEL` to external controllers. The design is sound and the changes are well-scoped. No critical issues; the main concerns are around observability of edge cases and a testing gap on Windows.

#### Critical Issues

None.

#### Important Issues

1. **Stale comment on `gracefulShutdownTimeout`** — `limavm_controller.go:84-85` (important, gap)

```go
// gracefulShutdownTimeout is the time to wait for a hostagent to exit
// after sending SIGINT before falling back to SIGKILL.
gracefulShutdownTimeout = 30 * time.Second
```

The comment references Unix-specific signal names ("SIGINT", "SIGKILL"), but the code now uses `process.Interrupt` (CTRL_BREAK on Windows, line 496) and `process.KillTree` (taskkill on Windows, line 512). The comment predates this PR (blame: `776b8f77`) but the PR changed the underlying behavior to be cross-platform. Readers will assume Unix-only behavior.

Fix: Update the comment to platform-agnostic language, e.g., "after sending a graceful shutdown signal before falling back to forced termination."

2. **Graceful shutdown test skips core verification on Windows** — `limavm-running.bats:331-336` (important, gap)

```bash
# On Windows, child process handles (wsl.exe) keep the hostagent PID visible
# in the process table long after termination. The next test ("restart service
# after graceful shutdown") verifies the hostagent is functionally dead.
if is_unix; then
    try --max 15 --delay 1 --until-fail -- assert_process_alive "${ha_pid}"
fi
```

The "graceful service stop terminates hostagent" test skips its key assertion on Windows. The follow-up test ("restart service after graceful shutdown") uses `--since=startup` to wait for a fresh Running condition, but this succeeds regardless of whether graceful shutdown worked or orphan recovery kicked in. On Windows, `process.Kill` (used by `rdd svc stop` at `service.go:341`) calls `TerminateProcess`, which bypasses Go's signal handlers entirely — `shutdownAllHostagents` never runs. The hostagent survives as an orphan. The test cannot distinguish "graceful shutdown worked" from "orphan recovery worked."

Fix: Either (a) add a Windows-specific check using `rdd ctl get limavm` to verify the hostagent was stopped by the service (not by orphan recovery), or (b) document the known limitation that `rdd svc stop` on Windows does not perform graceful hostagent shutdown and that orphan recovery is the expected path, then remove the misleading comment "no orphan — graceful shutdown killed it" from the next test.

3. **`setVerbosityFromEnv` silently ignores unrecognized levels** — `pkg/external/main.go:260-268` (important, regression)

```go
func setVerbosityFromEnv() {
    level := strings.TrimSpace(os.Getenv("RDD_LOG_LEVEL"))
    switch strings.ToLower(level) {
    case "trace":
        _ = flag.Set("v", "4")
    case "debug":
        _ = flag.Set("v", "2")
    }
}
```

The RDD CLI (`cmd/rdd/main.go:72-94`) supports six levels: `fatal`, `error`, `warn`, `info`, `debug`, `trace`. External controllers only handle two. Setting `RDD_LOG_LEVEL=info` or `RDD_LOG_LEVEL=warn` is silently ignored. Since klog's verbosity is numeric and can only increase detail, there is no way to map `warn`/`error`/`fatal` to klog's `-v` flag — that's a fundamental limitation. But the silent no-op will confuse operators: they set `RDD_LOG_LEVEL=warn` expecting quieter logs and get full info-level output.

Fix: Add a `default` case that logs when a non-empty, unrecognized level is set:
```go
default:
    if level != "" {
        klog.Infof("RDD_LOG_LEVEL=%q is not supported for klog (only trace, debug); using default verbosity", level)
    }
```

#### Suggestions

1. **`editor_cmd` doesn't quote the converted path** — `limavm-running.bats:88` (suggestion, regression)

```bash
editor_cmd() {
    local script=$1
    if is_msys; then
        echo "bash $(cygpath -m "${script}")"
    else
        echo "${script}"
    fi
}
```

If the MSYS2 temp directory ever contains spaces, `bash C:/path with spaces/script.sh` would break. Currently safe because CI temp paths don't contain spaces, but fragile.

Fix: Quote the path: `echo "bash '$(cygpath -m "${script}")'"`.

2. **`assert_process_alive` could false-positive on Windows** — `os.bash:103` (suggestion, regression)

```bash
MSYS_NO_PATHCONV=1 tasklist.exe /FI "PID eq ${pid}" /NH 2>/dev/null | grep -q "${pid}"
```

The `grep -q "${pid}"` matches the PID anywhere in the tasklist output line. If the PID string appears in an image name or window title, this would false-positive. Unlikely for short numeric PIDs, but brittle.

Fix: Anchor the grep to match a standalone number: `grep -qP "^\S+\s+${pid}\s"` or use a more precise filter.

3. **`ensureSSHKeys` checks only private key existence** — `ssh_keys_windows.go:31-33` (suggestion, regression)

```go
if _, err := os.Stat(privPath); err == nil {
    return nil
}
```

If the private key exists but the public key was deleted or corrupted, the function returns `nil` and Lima will later fail when it tries to use the keypair. Unlikely, but a public key check would make the function more robust.

Fix: Also stat `privPath + ".pub"` before returning early.

4. **`stopInstanceForcibly` file cleanup doesn't log which files it removed** — `limavm_lifecycle.go:559-566` (suggestion, regression)

```go
for _, f := range entries {
    for _, suffix := range filenames.TmpFileSuffixes {
        if strings.HasSuffix(f.Name(), suffix) {
            path := filepath.Join(inst.Dir, f.Name())
            if err := os.Remove(path); err != nil {
                logger.V(1).Info("Failed to remove tmp file", "path", path, "error", err)
            }
        }
    }
}
```

Failures are logged but successes are not. When debugging stale file issues, knowing which files were cleaned up is valuable.

Fix: Add a V(1) log for successful removals, or log the complete list of matched files before the loop.

#### Design Observations

**Strengths**

- The `process` package abstraction (`SetGroup`, `Interrupt`, `KillTree`) cleanly separates platform concerns from the controller logic. The reconciler never mentions signals or Windows APIs — it calls `shutdownHostagent` and the platform layer handles the rest.
- Replacing Lima's `StopForcibly` with `stopInstanceForcibly` is well-motivated. The comment at `limavm_lifecycle.go:521-528` clearly explains why: Lima's `SysKill` on Windows uses `GenerateConsoleCtrlEvent(CTRL_BREAK)` which targets the entire console group, killing the RDD service. The custom implementation targets only the hostagent's process tree and adds WSL2 distro cleanup.
- The `forceStop` closure in `shutdownHostagent` (line 483-496) that lazily resolves `inst` via `store.Inspect` is a clean way to handle both callers that have an instance handle (e.g., `stopInstance` at line 445) and those that don't (e.g., `handleRestartNeeded` at line 289).
- Extracting VM templates into `vm_template.bash` with `RDD_WSL_DISTRO` selection eliminates duplication across test files and makes Windows template selection configurable.
- The MSYS2 migration simplifies CI substantially — the WSL interop layer (binfmt_misc, wslpath, ghost files) was a fragile abstraction that the MSYS2 approach sidesteps entirely.

**Concerns**

- **(in-scope)** On Windows, `rdd svc stop` calls `process.Kill` (`service.go:341`), which calls `TerminateProcess` — an immediate, non-catchable kill. Go's signal handlers never fire, so `shutdownAllHostagents` never runs. Hostagent processes become orphans until the next service start detects them via `handleUnwatchedState`. On Unix, `process.Kill` sends SIGTERM, which triggers graceful shutdown. This asymmetry means the graceful shutdown path (`shutdownAllHostagents` with `process.Interrupt` → `process.KillTree` fallback) added by this PR is exercised on Unix but never on Windows during normal `rdd svc stop`. The orphan recovery path works, but it's a fundamentally different shutdown model. A possible fix: change `process.Kill` on Windows to send `CTRL_BREAK_EVENT` first (now that `SetGroup`/`CREATE_NEW_PROCESS_GROUP` is properly set up), wait for graceful exit, then fall back to `TerminateProcess`.

#### Testing Assessment

Test coverage is good overall. The BATS tests exercise the full lifecycle: create, start, crash recovery, orphan recovery, graceful shutdown, restart, template changes, and editor commands. The Windows adaptations (`assert_process_alive`, `process_kill`, `editor_cmd`, WSL2 distro cleanup in teardown) are sensible.

Untested scenarios, ranked by risk:

1. **Windows graceful shutdown** — as described in finding #2, the test cannot distinguish graceful shutdown from orphan recovery on Windows.
2. **SSH key generation failure on Windows** — no test covers what happens if `ssh-keygen` is not on PATH or fails. The controller returns a fatal error from `SetupWithManager`, but this isn't exercised.
3. **WSL2 distro cleanup timeout** — `stopInstanceForcibly` uses a 10-second timeout for `wsl.exe --terminate`. No test exercises the timeout path (degraded WSL subsystem).
4. **`setVerbosityFromEnv` integration** — no test verifies that external controllers honor `RDD_LOG_LEVEL`. The existing log-level tests in `4-log-level.bats` only cover the RDD CLI.

#### Documentation Assessment

No documentation gaps. The `vm_template.bash` header comment documents `RDD_WSL_DISTRO` usage. The `stopInstanceForcibly` godoc explains the motivation clearly. The `setVerbosityFromEnv` comment explains the flag.Parse ordering.

---

### Codex GPT 5.4

Verdict: Needs fixes before merge.

#### Executive Summary

This branch adds the Windows hostagent/process work needed to run Lima BATS tests under MSYS2 and to stop WSL2-backed instances more reliably. The Windows-specific direction is sound, but the new shared force-stop helper regresses non-Windows lifecycle behavior: orphan recovery now always hard-kills running guests, and forced stops no longer unlock Lima additional disks. The `RDD_LOG_LEVEL` change for external controllers is also only partially implemented.

#### Critical Issues

None.

#### Important Issues

1. **Controller restart now hard-kills orphaned VMs instead of attempting a graceful shutdown** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:253-257`, `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:516-517` `(important, regression)`

```go
case limatype.StatusRunning, limatype.StatusBroken:
    logger.Info("Found orphaned hostagent, killing it", "status", inst.Status)
    if err := r.killOrphanedHostagent(ctx, inst); err != nil {
        logger.Error(err, "Failed to kill orphaned hostagent")
        return ctrl.Result{}, err
    }
```

```go
func (r *LimaVMReconciler) killOrphanedHostagent(ctx context.Context, inst *limatype.Instance) error {
    stopInstanceForcibly(ctx, log.FromContext(ctx), inst)
    return nil
}
```

When the controller restarts and `store.Inspect()` still reports the instance as `Running` or `Broken` at lines 253-257, `killOrphanedHostagent()` now goes straight to `stopInstanceForcibly()` at lines 516-517. There is no graceful path, no timeout, and no wait for the guest to shut down cleanly. In practice, every controller restart with a running orphan now becomes a forced power-off, which is a real regression for guest filesystem integrity and for any workload that expects an orderly shutdown. This path is exercised on normal service crashes/restarts, not just exceptional delete flows.

Fix: restore the old graceful-first behavior for orphan recovery on Unix/macOS, and only use the Windows-specific force-stop workaround when the Windows signaling bug actually applies. If you want one helper, make it `shutdownOrphanedHostagent()` with a graceful phase plus a Windows-only force fallback.

2. **The new force-stop helper dropped Lima's additional-disk unlock step** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:52`, `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:195`, `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:495`, `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:517`, `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:532-565` `(important, regression)`

```go
func stopInstanceForcibly(ctx context.Context, logger logr.Logger, inst *limatype.Instance) {
    for _, pid := range []int{inst.DriverPID, inst.HostAgentPID} {
        if pid > 0 {
            if err := process.KillTree(ctx, pid); err != nil {
                logger.V(1).Info("Failed to kill process tree", "pid", pid, "error", err)
            }
        }
    }
    ...
    entries, err := os.ReadDir(inst.Dir)
```

`stopInstanceForcibly()` at lines 532-565 replaces `limainstance.StopForcibly()` on every force-stop path: deletion at line 52, orphaned-driver cleanup at line 195, graceful-stop fallback at line 495, and orphan recovery at line 517. The new helper kills processes and removes tmp files, but it no longer unlocks `inst.AdditionalDisks` anywhere in lines 532-565. Lima's original force-stop implementation did that explicitly before returning, so this change can leave `in_use_by` locks behind after forced stop/delete and block later reuse of those disks.

Fix: port Lima's additional-disk unlock loop into `stopInstanceForcibly()` before returning, or call the upstream `limainstance.StopForcibly()` on platforms where its signal behavior is safe and keep the custom helper Windows-only.

#### Suggestions

1. **`RDD_LOG_LEVEL` is still ignored for `info`, `warn`, and `error` in external controllers** — `pkg/external/main.go:70`, `pkg/external/main.go:76`, `pkg/external/main.go:113`, `pkg/external/main.go:124`, `pkg/external/main.go:129`, `pkg/external/main.go:178`, `pkg/external/main.go:189`, `pkg/external/main.go:260-267` `(suggestion, gap)`

```go
func setVerbosityFromEnv() {
    level := strings.TrimSpace(os.Getenv("RDD_LOG_LEVEL"))
    switch strings.ToLower(level) {
    case "trace":
        _ = flag.Set("v", "4")
    case "debug":
        _ = flag.Set("v", "2")
    }
}
```

The new mapping at lines 260-267 only handles `trace` and `debug`. For `RDD_LOG_LEVEL=info`, `warn`, or `error`, verbosity stays at the default `0`, so the unconditional `Info` calls at lines 70, 76, 113, 124, 129, 178, and 189 still log. That means the branch only partially honors the environment variable, and the common "quiet down external controllers" settings still do nothing.

Fix: either map the full `logrus` level set to a logger that supports severity filtering, or document and implement an explicit translation for non-debug levels so `warn`/`error` actually suppress `Info` output.

#### Design Observations

**Concerns**

- `(in-scope)` The new `stopInstanceForcibly()` helper is shared across Windows and non-Windows callers, but its reason for existing is Windows-specific. Keeping one replacement widened the blast radius to Unix/macOS and caused both regressions above. A better split is: preserve Lima's existing graceful/force lifecycle on non-Windows, and layer the custom Windows process-tree / WSL2 cleanup only behind a Windows-specific branch.

**Strengths**

- Pre-generating Lima SSH keys during controller setup is a good place to contain the MSYS2/OpenSSH path-conversion workaround, because it avoids scattering Windows-specific keygen logic through the start path.
- Using a fresh context for WSL cleanup during manager shutdown is a solid detail; it avoids inheriting the already-cancelled manager context.

#### Testing Assessment

1. There is no coverage for forced-stop flows with `additionalDisks`, so this branch would not catch leaked disk locks after delete, crash recovery, or restart fallback.
2. There is no test asserting that orphan recovery after a controller restart performs a graceful shutdown on Unix/macOS rather than an immediate hard kill.
3. There is no direct test matrix for external controllers with `RDD_LOG_LEVEL=trace/debug/info/warn/error`, so the partial env-var handling goes unnoticed.

#### Documentation Assessment

No major user-facing docs are obviously required for the internal lifecycle changes, but the new Windows test-mode assumptions would benefit from a short contributor note covering the MSYS2/OpenSSH requirement and the `RDD_WSL_DISTRO` test knob.

---

### Gemini 3.1 Pro

#### Executive Summary

This branch implements Windows-specific process signaling, pre-generates Lima SSH keys to avoid MSYS2 path translation issues, and migrates Windows BATS testing from WSL2 to MSYS2. While the branch successfully addresses the targeted lifecycle issues, it introduces two critical regressions in process termination (`process.KillTree`) that can lead to permanent controller deadlocks and orphaned VM driver processes.

#### Critical Issues

1. **`process.KillTree` failures can cause indefinite hangs during shutdown and reconciliation** — `pkg/controllers/lima/limavm/controllers/limavm_controller.go:508-514`, `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:503-504` (critical, regression)

```go
                    graceful = false
                    if state.cmd != nil && state.cmd.Process != nil {
                            _ = process.KillTree(context.Background(), state.cmd.Process.Pid)
                    }
                    <-state.procExited
```
When graceful shutdown times out, the code attempts to forcefully kill the process using `process.KillTree` and then unconditionally blocks on `<-state.procExited` (in `shutdownAllHostagents`) or `r.waitForProcessExit` (in `shutdownHostagent`). If `KillTree` fails to kill the hostagent or any child process holding the stdout/stderr pipes open (e.g. `taskkill` fails on Windows or the child escapes the process group), the `cmd.Wait()` call in the watcher goroutine will never return. Because there is no timeout on `<-state.procExited`, a single failure will permanently hang the daemon shutdown or permanently consume a controller worker thread.

**Fix:** Add a short timeout when waiting for `procExited` after a forceful kill, and fall back to `state.cmd.Process.Kill()` if `KillTree` fails to ensure the parent process is at least terminated.

**Consolidation note:** This finding is a **false positive**. In `shutdownAllHostagents`, the `<-state.procExited` after `KillTree` at line 514 is inside the timeout fallback of a `select` block (lines 505-514) that already timed out on the graceful phase. `KillTree` sends SIGKILL, which cannot be caught — the process must exit, and `cmd.Wait()` will return. On Windows, `KillTree` uses `taskkill /F /T`, which also force-terminates.

2. **`process.KillTree` on Unix fails to terminate driver processes if they are not group leaders** — `pkg/util/process/process_unix.go:21-23`, `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:535-537` (critical, regression)

```go
// KillTree sends SIGKILL to the process group led by the given PID.
func KillTree(_ context.Context, pid int) error {
        return unix.Kill(-pid, unix.SIGKILL)
}
```
On Unix, `KillTree` exclusively sends `SIGKILL` to the process group (`-pid`). In `stopInstanceForcibly`, this is called on both `inst.HostAgentPID` and `inst.DriverPID`. If the driver process was spawned by the hostagent without being placed in its own new process group, it will not be a group leader. Calling `unix.Kill(-inst.DriverPID, ...)` will fail with `ESRCH` (No such process), and the driver process will remain running.

**Fix:** In `process_unix.go`, if `unix.Kill(-pid, unix.SIGKILL)` fails with `ESRCH` or `EPERM`, fall back to killing the process directly with `unix.Kill(pid, unix.SIGKILL)`.

**Consolidation note:** This finding is a **false positive**. Lima spawns the driver with `Setpgid: true` (verified at `lima/pkg/driver/qemu/qemu_driver.go:295` via `BackgroundSysProcAttr`), so the driver is a process group leader. `kill(-driverPID, SIGKILL)` works correctly.

#### Important Issues

1. **TOCTOU race and corrupted file risk in `ensureSSHKeys`** — `pkg/controllers/lima/limavm/controllers/ssh_keys_windows.go:37-43` (important, gap)

```go
       cmd := exec.CommandContext(ctx, "ssh-keygen", "-t", "ed25519", "-q", "-N", "",
               "-C", "lima", "-f", privPath)
       if out, err := cmd.CombinedOutput(); err != nil {
```
`ensureSSHKeys` checks for the existence of the SSH key and generates it if missing. If `ssh-keygen` fails midway (e.g., due to an interruption, memory issue, or timeout), it may leave behind a partial or invalid private key or just the public key. Subsequent runs will see that `privPath` exists, skip generation, and Lima will consistently fail to start VMs. Additionally, concurrent execution across controllers could cause multiple `ssh-keygen` instances to write to the same file.

**Fix:** Write the generated keypair to a temporary file (e.g., `privPath + ".tmp"`) and atomically rename it to `privPath` upon success.

2. **Inconsistent wait order in `shutdownHostagent` else block** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:508-510` (important, gap)

```go
        } else {
                logger.Info("Could not signal hostagent, killing process directly")
                r.killHostagent(name)
                r.waitForProcessExit(ctx, name)
                forceStop()
        }
```
In the fallback branch where `signalHostagent` fails (e.g., signal delivery error), the code kills the hostagent and waits for it to exit before calling `forceStop()`. `forceStop()` is responsible for cleaning up WSL2 distros and tmp files. If the process wait blocks indefinitely (as discussed above), the cleanup is completely skipped.

**Fix:** Ensure `forceStop()` is called to terminate the process tree and clean up resources *before* waiting for the process to exit.

**Consolidation note:** The concern about indefinite blocking is invalid (same false positive as Critical #1 — `killHostagent` sends SIGKILL). However, the observation about call ordering has some merit: if `killHostagent` fails (process already exited), `forceStop()` would still clean up WSL2 distros and tmp files. The current order is defensible — kill the process first, wait for exit, then clean up — but `forceStop()` could be called unconditionally regardless of kill success.

#### Suggestions

1. **Rename `run-in-wsl.cmd` wrapper script to reflect MSYS2** — `.github/workflows/bats.yaml:124-126` (suggestion, clarity)

```yaml
    - name: "Windows: write MSYS2 wrapper script"
      if: runner.os == 'Windows'
      shell: cmd /c copy {0} run-in-wsl.cmd
```
The PR successfully migrates Windows BATS testing from WSL2 to MSYS2, but it still writes the MSYS2 wrapper logic into a file named `run-in-wsl.cmd` and uses `shell: run-in-wsl {0}` throughout the workflow. This is highly confusing for future maintainers.

**Fix:** Rename the file to `run-in-msys.cmd` and update the corresponding `shell` configurations.

2. **Handle errors when setting `klog` verbosity** — `pkg/external/main.go:264-267` (suggestion, correctness)

```go
       case "trace":
               _ = flag.Set("v", "4")
```
While failing to parse the `RDD_LOG_LEVEL` or set the flag is unlikely, explicitly ignoring the error suppresses potential misconfigurations.

**Fix:** Log a warning using `fmt.Fprintf(os.Stderr, ...)` if `flag.Set` returns an error.

#### Design Observations

**Concerns**

- `stopInstanceForcibly` and lifecycle state: The reliance on process handles (`state.cmd`) combined with external state (`store.Inspect()`) leads to complex fallback trees where it is unclear who owns process termination. Currently, the code mixes direct signal delivery with `process.KillTree` at various depths depending on whether a graceful timeout fired. Moving all forceful termination to a single abstraction that robustly tears down the tree via both group and direct PID signals (regardless of origin) would eliminate these edge cases. `(in-scope)`

#### Testing Assessment

The test updates correctly reflect the migration to MSYS2 and adequately cover standard process recovery on Windows. However, there are some gaps in failure simulation:

1. **Hung processes:** Simulating a hung hostagent or driver process that ignores `SIGINT`/`CTRL_BREAK` to verify that the controller correctly falls back to `KillTree` and does not hang the worker thread.
2. **Missing SSH directories:** Generating SSH keys via `ensureSSHKeys` when the parent configuration directory doesn't exist, and verifying the contents are valid.

#### Clarification Q&A

No clarification round was needed — the false positives were resolved through code verification rather than agent follow-up.
