# Deep Review: 20260408-163819-pr-280

| | |
|---|---|
| **Date** | 2026-04-08 16:38 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 2 (of PR 280) |
| **Mode** | Standard |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#280](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/280) — Add host-switch virtual network for WSL2 instances |
| **Branch** | `host-switch` |
| **Commits** | `6eee65c` Add host-switch virtual network for WSL2 instances |
| **Reviewers** | Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro |
| **Verdict** | **Merge with fixes** — concurrency bugs in the vsock handshake and a missing cleanup call need attention |
| **Wall-clock time** | `24 min 15 s` |

---

## Executive Summary

This PR ports the host-switch virtual network from rancher-desktop's networking module into the LimaVM controller as an in-process goroutine, enabling WSL2 instances to get network connectivity via gvisor-tap-vsock over AF_VSOCK. The implementation is well-structured with clean lifecycle integration and thorough documentation. The main risks are concurrency bugs in the vsock handshake protocol — nested goroutines that can panic on a closed channel, missing read deadlines that leak goroutines, and a `signalListenerReady` call that ignores context — plus one missing `stopHostSwitch` call in a lifecycle branch.

---

## Critical Issues

None.

---

## Important Issues

I1. **Missing `stopHostSwitch` in `handleWatchedState` default case** — `limavm_lifecycle.go:265-274` (important, regression) [Claude Opus 4.6]

```go
default:
    // phase == phaseStopped && !shouldRun
    r.stopWatcher(limaVM.Name)
    if !base.HasConditionWithReason(limaVM.Status.Conditions, ConditionRunning, metav1.ConditionFalse, ReasonStopped) {
```

When the hostagent exits while `shouldRun` is false, the watcher is stopped at line 267 but the host-switch goroutine is not. The host-switch holds a vsock listener, a gvisor-tap-vsock virtual network, and an HTTP server — all permanently leaked. If the VM restarts later, `startHostSwitch` at line 116 overwrites the `hostSwitchStates` entry, making the old goroutine unreachable.

Every other teardown path calls `stopHostSwitch`: `handleDeletion` (line 40), crash recovery in `phaseStopped && shouldRun` (line 213), `handleRestartNeeded` (line 333), `stopInstance` (line 505), and `shutdownAllHostagents` (line 543). This branch is the sole gap.

Fix:

```diff
 default:
     // phase == phaseStopped && !shouldRun
     r.stopWatcher(limaVM.Name)
+    r.stopHostSwitch(limaVM.Name)
     if !base.HasConditionWithReason(limaVM.Status.Conditions, ConditionRunning, metav1.ConditionFalse, ReasonStopped) {
```

I2. **Nested handshake goroutines can panic on closed `found` channel** — `hostswitch_windows.go:315-387` (important, regression) [Codex GPT 5.4]

```go
go func() {
    wg.Wait()
    close(found)
}()
// ...
case <-ticker.C:
    go func() {
        // ...
        select {
        case found <- vmGUID:
        default:
        }
    }()
```

`getVMGUID` closes `found` at line 333 after the per-VM `attemptHandshake` goroutines finish (`wg.Wait()`), but `attemptHandshake` spawns a nested goroutine every second that is not tracked by the WaitGroup. When context cancellation causes `attemptHandshake` to return, `wg.Wait()` completes and `close(found)` fires while nested goroutines are still in-flight. Any goroutine that passed the `ctx.Err()` check at line 363 before cancellation, then completes `readSignature` after `close(found)`, will send on a closed channel and panic.

In Go, sending on a closed channel panics even inside a `select` with a `default` case. The one-second ticker guarantees multiple in-flight goroutines during normal operation, making this race plausible during startup.

Fix: Remove the nested goroutine and perform the dial synchronously within the ticker loop. This limits concurrency to one outstanding probe per VM, which also addresses the goroutine accumulation issue (see I3).

```diff
         case <-ticker.C:
-            go func() {
-                conn, err := getVsockConnection(vmGUID, vsockHandshakePort)
-                if ctx.Err() != nil {
-                    if conn != nil {
-                        conn.Close()
-                    }
-                    return
-                }
+                conn, err := getVsockConnection(vmGUID, vsockHandshakePort)
+                if ctx.Err() != nil {
+                    if conn != nil {
+                        conn.Close()
+                    }
+                    continue
+                }
```

I3. **Missing read deadline on `readSignature` causes goroutine leak** — `hostswitch_windows.go:393-398` (important, gap) [Claude Opus 4.6, Gemini 3.1 Pro]

```go
func readSignature(conn net.Conn) (string, error) {
    buf := make([]byte, len(signaturePhrase))
    if _, err := io.ReadFull(conn, buf); err != nil {
        return "", err
    }
    return string(buf), nil
}
```

`io.ReadFull` at line 395 has no read deadline. If a VM accepts the connection but sends no data (or sends partial data), the goroutine blocks forever. Combined with the nested goroutine spawning in `attemptHandshake`, multiple goroutines can accumulate over the 5-minute handshake timeout — up to 300 per VM GUID.

Fix: Set a read deadline before the read:

```diff
 func readSignature(conn net.Conn) (string, error) {
+    if err := conn.SetReadDeadline(time.Now().Add(5 * time.Second)); err != nil {
+        return "", err
+    }
     buf := make([]byte, len(signaturePhrase))
```

I4. **Host-switch failure has no recovery path** — `hostswitch_windows.go:155-229` (important, gap) [Codex GPT 5.4, Gemini 3.1 Pro]

```go
if err := g.Wait(); err != nil {
    logger.Error(err, "Host-switch exited with error")
} else {
    logger.Info("Host-switch stopped")
}
```

If `runHostSwitch` fails after startup (handshake timeout, virtual network failure, listener error), it logs the error and exits silently. Unlike the hostagent watcher, which enqueues reconciles on state changes, the host-switch has no mechanism to notify the controller. The guest's `network-setup.service` blocks on the handshake during boot, so the VM hangs permanently without triggering a restart.

Fix: When `runHostSwitch` exits unexpectedly (context not cancelled), enqueue a reconcile so the controller can restart the host-switch or the entire instance.

I5. **`signalListenerReady` does not respect context or timeout** — `hostswitch_windows.go:377-384` (important, gap) [Gemini 3.1 Pro]

```go
func signalListenerReady(vmGUID hvsock.GUID) error {
    conn, err := getVsockConnection(vmGUID, vsockHandshakePort)
    if err != nil {
        return err
    }
    defer conn.Close()
    _, err = conn.Write([]byte(readySignal))
    return err
}
```

`signalListenerReady` calls `hvsock.Dial` (via `getVsockConnection` at line 378) without any timeout or context. The `handshakeTimeout` applied to `getVMGUID` at line 268 does not cover this call. If the VM becomes unresponsive between discovery and signaling, `hvsock.Dial` can hang indefinitely. This blocks `runHostSwitch` from exiting, which blocks `stopHostSwitch` on `<-state.done`, which stalls the controller's reconcile loop.

Fix: Pass a context and enforce a timeout:

```diff
-func signalListenerReady(vmGUID hvsock.GUID) error {
-    conn, err := getVsockConnection(vmGUID, vsockHandshakePort)
+func signalListenerReady(ctx context.Context, vmGUID hvsock.GUID) error {
+    type result struct {
+        err error
+    }
+    ch := make(chan result, 1)
+    go func() {
+        conn, err := getVsockConnection(vmGUID, vsockHandshakePort)
+        if err != nil {
+            ch <- result{err}
+            return
+        }
+        defer conn.Close()
+        _, err = conn.Write([]byte(readySignal))
+        ch <- result{err}
+    }()
+    select {
+    case r := <-ch:
+        return r.err
+    case <-ctx.Done():
+        return ctx.Err()
+    }
+}
```

---

## Suggestions

S1. **Defensive double-start guard in `startHostSwitch`** — `hostswitch_windows.go:116-121` (suggestion, gap) [Claude Opus 4.6, Gemini 3.1 Pro]

```go
r.hostSwitchMu.Lock()
r.hostSwitchStates[name] = &hostSwitchState{
    cancel: hsCancel,
    done:   done,
}
r.hostSwitchMu.Unlock()
```

If `startHostSwitch` is called while a host-switch already exists for the same instance, the old entry is silently overwritten, leaking the old goroutine. Current call sites prevent this by calling `stopHostSwitch` first, but a defensive check would catch future regressions.

Fix:

```diff
 r.hostSwitchMu.Lock()
+if old, exists := r.hostSwitchStates[name]; exists {
+    r.hostSwitchMu.Unlock()
+    old.cancel()
+    <-old.done
+    r.hostSwitchMu.Lock()
+}
 r.hostSwitchStates[name] = &hostSwitchState{
```

S2. **Design doc references non-existent `hostswitch.go`** — `docs/design/networking.md:174` (suggestion, regression) [Claude Opus 4.6]

```
| `hostswitch.go` | (none) | Subnet validation, `hostSwitchState` type |
```

The file table describes a platform-independent `hostswitch.go`, but only `hostswitch_windows.go` and `hostswitch_other.go` exist. Either update the table to match reality, or extract `validateSubnet`, `hostSwitchSubnet`, and `newVirtualNetworkConfig` into a platform-independent file (which would also allow the unit tests to run on all platforms).

S3. **Timeout increase from 60s to 150s lacks comment** — `bats/tests/32-app-controllers/app.bats:132` (suggestion, regression) [Claude Opus 4.6]

```bash
rdd ctl wait --for=condition=Created \
    limavm/"${VM_NAME}" --namespace "${RDD_NAMESPACE}" --timeout=150s
```

The `Created` condition timeout was increased 2.5x, presumably because downloading the opensuse distro image is slower than the Finch rootfs. A brief comment would prevent someone from reverting it without understanding the reason.

S4. **Exclude non-matching VMs from further handshake attempts** — `hostswitch_windows.go:357-360` (suggestion, enhancement) [Gemini 3.1 Pro]

When `readSignature` reads a valid but non-matching signature, `attemptHandshake` loops and dials the same VM again one second later. It bombards that VM for the full 5-minute timeout. If the signature does not match, exit the loop — the VM cannot be the target distro.

S5. **Synchronous `AcceptStdio` blocks concurrent connections** — `hostswitch_windows.go:189-193` (suggestion, gap) [Gemini 3.1 Pro]

```go
if err := vn.AcceptStdio(ctx, conn); err != nil {
    logger.Error(err, "Failed to accept connection into virtual network")
} else {
    logger.Info("Accepted vsock connection")
}
```

`AcceptStdio` blocks until the connection closes, so the accept loop can serve only one connection at a time. In the normal case there is exactly one data connection from `vm-switch`, so this works. If `vm-switch` crashes and reconnects before the host detects the disconnect (half-open connection), the new connection waits behind the old one. Processing connections concurrently (`go func(c net.Conn) { vn.AcceptStdio(ctx, c) }(conn)`) would handle this edge case.

---

## Design Observations

**Concerns**

- **Host-switch lifecycle is coupled by convention, not structure** (future) [Claude Opus 4.6] — The host-switch goroutine's context derives from the reconciler context, not from the hostagent's lifecycle. Cleanup depends on every teardown path remembering to call both `stopWatcher` and `stopHostSwitch`. A single missed call leaks the goroutine (I1). Tying the host-switch context to the watcher's context would provide structural safety instead of relying on convention.

- **Host-switch introduces a second unmonitored long-lived component** (in-scope) [Codex GPT 5.4] — The hostagent has a watcher that enqueues reconciles on state changes. The host-switch has no equivalent. Extending the existing per-instance state machine to include host-switch health (I4) would remove the need for bespoke recovery logic.

**Strengths**

- Clean platform abstraction via embedded `hostSwitchPlatform` struct with no-op stubs for non-Windows. The build-tag split is minimal and idiomatic. [Claude Opus 4.6, Codex GPT 5.4]
- Correct lifecycle ordering: the host-switch starts before the hostagent (matching the guest's boot dependency) and stops after it. [Claude Opus 4.6, Codex GPT 5.4]
- The design doc (`networking.md`) is thorough — clear sequence diagrams, address table, and lifecycle documentation that will pay dividends for future maintainers. [Claude Opus 4.6, Gemini 3.1 Pro]

---

## Testing Assessment

1. **No test for the closed-channel panic (I2)** — Multiple in-flight probes to the same VM, followed by cancellation, can trigger a panic. No test exercises this race.
2. **No integration test for host-switch lifecycle** — The unit tests cover `validateSubnet` and `newVirtualNetworkConfig`, but the goroutine lifecycle (`startHostSwitch`/`stopHostSwitch` with the mutex and map) has no coverage.
3. **Unit tests are Windows-only** — `validateSubnet` and `newVirtualNetworkConfig` contain no Windows-specific logic, but because they live in `hostswitch_windows.go`, the tests only run on Windows. Extracting them to a platform-independent file (S2) would enable cross-platform testing.
4. **No end-to-end network verification** — The BATS timeout increase confirms the VM needs more time to boot, but no test verifies that the network actually works (DHCP assignment, DNS resolution).

---

## Documentation Assessment

- `docs/design/networking.md` is well-written and comprehensive. The file table at line 174 references a non-existent `hostswitch.go` (S2). The dependency section states the packages are transitive via Lima, but `go.mod` now lists them as direct requirements.
- `docs/design/api_lima.md` correctly links to the new networking doc under the `LimaNetwork` section.

---

## Acknowledged Limitations

- **Sequential hostagent shutdown** — `limavm_controller.go:508` describes a TODO to wait on hostagents in parallel. The new `stopHostSwitch` call at line 543 adds to per-instance shutdown time, slightly worsening this existing limitation.
- **Windows signal support** — The project memory notes that `os.Signal` sending is not implemented on Windows. The host-switch uses context cancellation instead of signals, so this does not affect the new code.

---

## Unresolved Feedback

Both inline comments from Nino-K have been addressed in the current revision:

1. **`return` after `vn.Listen` failure skips `g.Wait()`** ([comment](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/280#discussion_r3054525099)) — The code was restructured so that the API listener setup now occurs before the errgroup is created (lines 170-182). The errgroup starts at line 183, after the early return. The concern no longer applies.

2. **Missing `ln.Close()` before `return` after `virtualnetwork.New` failure** ([comment](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/280#discussion_r3054530759)) — The current code at line 163 calls `ln.Close()` before returning. Addressed.

---

## Agent Performance Retro

### [Claude]

Claude produced the most thorough lifecycle analysis, tracing every `stopHostSwitch` call site with `git blame` (8 blame calls) to identify the sole missing cleanup in I1. It also identified the read-deadline gap in `readSignature` (I3) and several documentation issues (S2, S3). Claude's findings were accurate with no false positives.

**Tool call highlights:** Claude ran 8 `git blame` commands covering all lifecycle paths — the most blame-intensive of the three agents. It read the lifecycle file 7 times at different offsets, demonstrating systematic tracing. 35 total tool calls with a good mix of Read (20), Bash (9), Grep (5), and Glob (1).

### [Codex]

Codex identified the most subtle bug — the closed-channel panic in I2. This required tracing the interaction between `getVMGUID`'s WaitGroup/close pattern and `attemptHandshake`'s nested goroutines, a cross-function concurrency analysis that neither Claude nor Gemini performed. Codex also identified the host-switch recovery gap (I4) and traced into the hostagent watcher to contrast its reconcile-on-exit pattern.

Codex rated I2 as CRITICAL. The severity was downgraded to IMPORTANT because the panic requires specific timing — a nested goroutine must pass the `ctx.Err()` check before cancellation, then complete after `close(found)`. However, this is the most important finding in the review: a real panic that would crash the controller.

**Tool call highlights:** Codex used 47 tool calls, the most of the three. It traced into dependency source code (gvisor-tap-vsock's `stdio.go`, `gvproxy/main.go`, and virtsock's `hvsock_windows.go`) to verify `AcceptStdio` blocking behavior and `hvsock.Dial` timeout semantics. Also ran `go test` to verify unit tests pass. 6 `git blame` calls.

### [Gemini]

Gemini identified four findings in the host-switch protocol code: the `signalListenerReady` timeout gap (I5), the unbounded dial concurrency (contributing to I3), the synchronous `AcceptStdio` concern (S5), and the `readSignature` deadline gap (I3, shared with Claude). It also traced into gvisor-tap-vsock's source (11 shell commands exploring dependency code).

Gemini rated the `signalListenerReady` finding as CRITICAL. The severity was downgraded to IMPORTANT because the scenario (VM unresponsive between successful discovery and ready signal) is low probability.

Gemini did not run `git blame`, consistent with the known behavior that its daily Pro quota is too low for blame calls. This means it could not distinguish regressions from pre-existing issues — all findings were classified as "gap" rather than "regression."

**Tool call highlights:** 21 total tool calls — the fewest, but focused on dependency analysis. 5 `read_file` calls, 5 `grep_search` calls, 11 shell commands. Gemini explored `AcceptStdio` source code in depth (5 shell commands on gvisor-tap-vsock internals), leading to the S5 finding that other agents missed.

### Summary

| | Claude Opus 4.6 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 331s | 289s | 261s |
| Findings | 0C 2I 4S | 0C 2I 0S | 0C 4I 2S |
| Tool calls | 35 (Read 20, Bash 9) | 47 (exec_command 45) | 21 (shell 11, read 5) |
| Design observations | 1 | 1 | 1 |
| False positives | 0 | 0 | 0 |
| Unique insights | I1, S2, S3 | I2 | I5, S4, S5 |
| Files reviewed | 10/10 | 10/10 | 10/10 |
| Coverage misses | 0 | 0 | 0 |

**Reconciliation:**

- Codex P1 closed-channel panic: CRITICAL → IMPORTANT I2 (timing-dependent race, controller restarts on panic)
- Gemini P1 `signalListenerReady` timeout: CRITICAL → IMPORTANT I5 (low-probability scenario — VM was responsive moments before)
- Gemini P1 synchronous `AcceptStdio`: IMPORTANT → SUGGESTION S5 (single-connection is the normal case; reconnect works once broken connection detected)

All three agents achieved full coverage with no false positives. Codex provided the deepest concurrency analysis (I2). Claude provided the most thorough lifecycle tracing (I1). Gemini provided the broadest protocol analysis (I5, S4, S5) by reading dependency source code.

---

## Review Process Notes

No process improvements identified.

---

## Appendix: Original Reviews

### [Claude] — Pass 1

Now I have all the context needed. Let me write the review.

---

## Executive Summary

This PR ports host-switch virtual networking into the LimaVM controller as an in-process goroutine, enabling WSL2 instances to get network connectivity via gvisor-tap-vsock over AF_VSOCK. The implementation is well-structured with clean lifecycle integration. Two issues need attention: a missing `stopHostSwitch` call in one lifecycle branch leaks the goroutine, and the handshake dial goroutines can leak due to missing read deadlines.

## Findings

### Critical Issues

None.

### Important Issues

I1. **Missing `stopHostSwitch` in `handleWatchedState` default case** — `limavm_lifecycle.go:265-274` (important, regression)

```go
default:
    // phase == phaseStopped && !shouldRun
    r.stopWatcher(limaVM.Name)
    if !base.HasConditionWithReason(limaVM.Status.Conditions, ConditionRunning, metav1.ConditionFalse, ReasonStopped) {
```

When the hostagent crashes (`phaseStopped`) while `shouldRun` is false, the watcher is stopped but the host-switch goroutine is not. The host-switch was started in `startInstance` at line 447 and remains running — holding a vsock listener, a gvisor-tap-vsock virtual network, and an HTTP server — with no way to reach it. If the VM is started again later, `startHostSwitch` (line 116-121) overwrites the `hostSwitchStates` entry, permanently leaking the old goroutine's resources.

Every other branch that tears down the hostagent also tears down the host-switch: `handleDeletion` (line 40), crash recovery in `phaseStopped && shouldRun` (line 213), `handleRestartNeeded` (line 333), `stopInstance` (line 505), `shutdownAllHostagents` (line 543). This branch is the sole gap.

Fix:

```diff
 default:
     // phase == phaseStopped && !shouldRun
     r.stopWatcher(limaVM.Name)
+    r.stopHostSwitch(limaVM.Name)
     if !base.HasConditionWithReason(limaVM.Status.Conditions, ConditionRunning, metav1.ConditionFalse, ReasonStopped) {
```

I2. **Goroutine leak in `attemptHandshake`: no read deadline on vsock connections** — `hostswitch_windows.go:361-387` (important, regression)

```go
go func() {
    conn, err := getVsockConnection(vmGUID, vsockHandshakePort)
    if ctx.Err() != nil {
        if conn != nil {
            conn.Close()
        }
        return
    }
    if err != nil {
        // ...
        return
    }

    sig, err := readSignature(conn)
    conn.Close()
```

Each ticker tick at line 359 spawns a goroutine that dials the vsock, checks `ctx.Err()` once at line 363, then calls `readSignature` at line 374. The `readSignature` function (line 393-398) calls `io.ReadFull` with no read deadline. If a VM accepts the connection but never sends data (or sends partial data), the goroutine blocks forever on `io.ReadFull`.

When `getVMGUID` finds a match and calls `cancel()` at line 313, in-flight goroutines that already passed the `ctx.Err()` check at line 363 continue into `readSignature` and can hang indefinitely. Since `attemptHandshake` spawns one goroutine per second per VM GUID, multiple goroutines can accumulate during the 5-minute handshake timeout.

Fix: Set a read deadline before calling `readSignature`:

```diff
+               conn.SetReadDeadline(time.Now().Add(5 * time.Second))
                sig, err := readSignature(conn)
                conn.Close()
```

### Suggestions

S1. **`startHostSwitch` doesn't guard against double-start** — `hostswitch_windows.go:116-121` (suggestion, gap)

```go
r.hostSwitchMu.Lock()
r.hostSwitchStates[name] = &hostSwitchState{
    cancel: hsCancel,
    done:   done,
}
r.hostSwitchMu.Unlock()
```

If `startHostSwitch` is called while a host-switch already exists for the same instance, the old entry (with its cancel function and done channel) is silently overwritten. The old goroutine becomes unreachable. Current call sites prevent this by calling `stopHostSwitch` first, but a defensive check would catch future regressions early.

Fix:

```diff
 r.hostSwitchMu.Lock()
+if old, exists := r.hostSwitchStates[name]; exists {
+    r.hostSwitchMu.Unlock()
+    old.cancel()
+    <-old.done
+    r.hostSwitchMu.Lock()
+}
 r.hostSwitchStates[name] = &hostSwitchState{
```

S2. **Design doc references non-existent `hostswitch.go`** — `docs/design/networking.md:174` (suggestion, regression)

```
| `hostswitch.go` | (none) | Subnet validation, `hostSwitchState` type |
```

The table describes a platform-independent `hostswitch.go` file for subnet validation and the `hostSwitchState` type, but this file does not exist. Both are defined in `hostswitch_windows.go` (with `//go:build windows`). The `hostSwitchSubnet` type and `validateSubnet` function could be moved to a platform-independent file to match the doc and enable cross-platform unit tests, but as-is the doc is inaccurate.

Fix: Either update the table to reflect reality (subnet validation is in `hostswitch_windows.go`), or extract `hostSwitchSubnet`, `validateSubnet`, and `newVirtualNetworkConfig` into a platform-independent `hostswitch.go` and move the corresponding tests out of `hostswitch_windows_test.go`.

S3. **Timeout increase from 60s to 150s lacks explanatory comment** — `bats/tests/32-app-controllers/app.bats:132` (suggestion, regression)

```bash
    rdd ctl wait --for=condition=Created \
        limavm/"${VM_NAME}" --namespace "${RDD_NAMESPACE}" --timeout=150s
```

The timeout for the `Created` condition wait was increased from 60s to 150s. This is presumably because downloading the opensuse distro image is slower than the Finch rootfs, but there's no comment explaining the increase or what drives the 150s value. A comment would prevent someone from "fixing" it back to 60s.

S4. **`readSignature` reads exactly `len(signaturePhrase)` bytes with no length prefix or delimiter** — `hostswitch_windows.go:393-398` (suggestion, gap)

```go
func readSignature(conn net.Conn) (string, error) {
	buf := make([]byte, len(signaturePhrase))
	if _, err := io.ReadFull(conn, buf); err != nil {
		return "", err
	}
	return string(buf), nil
}
```

The function assumes the peer sends exactly `len(signaturePhrase)` bytes as its first message. If a future protocol revision changes the signature length, `readSignature` would silently read a truncated or extended prefix. This is acceptable for a fixed protocol constant but worth noting: any change to `signaturePhrase` on one side will silently break the handshake rather than producing a clear error.

## Design Observations

**Concerns**

- The host-switch goroutine's context is derived from the reconciler context (the manager context), not from the hostagent's lifecycle. This means the host-switch outlives the hostagent unless explicitly stopped. The lifecycle coupling is maintained by convention (every stop path calls `stopHostSwitch`), but a single missed call site leaks the goroutine (as demonstrated by I1). An alternative would be to tie the host-switch context to the watcher's context, so stopping the watcher automatically stops the host-switch. This would provide structural safety instead of relying on every code path remembering to call both `stopWatcher` and `stopHostSwitch`. (future)

**Strengths**

- Clean platform abstraction via embedded `hostSwitchPlatform` struct with no-op stubs for non-Windows. The build-tag split is minimal and idiomatic.
- Thorough lifecycle integration: `startHostSwitch` is called before the hostagent (matching the guest's boot dependency), and cleanup calls are placed in every error path of `startInstance`.
- The design doc is excellent — clear sequence diagrams, address table, and lifecycle integration documentation that will pay dividends for future maintainers.

## Testing Assessment

1. **No integration test for host-switch lifecycle** — The unit tests cover `validateSubnet` and `newVirtualNetworkConfig`, but there's no test for the goroutine lifecycle (start, stop, double-stop idempotency). Testing the full vsock handshake requires a Windows+WSL2 environment, but the state management (`startHostSwitch`/`stopHostSwitch` with the mutex and map) could be tested on any platform with a mock or by extracting the state management.
2. **No test for the `phaseStopped && !shouldRun` cleanup path** — The gap identified in I1 would benefit from a test that verifies all resources are cleaned up when the hostagent crashes and the user doesn't want it restarted.
3. **Unit tests are Windows-only** — `validateSubnet` and `newVirtualNetworkConfig` contain no Windows-specific logic, but because they're in `hostswitch_windows.go`, the tests can only run on Windows. See S2.

## Documentation Assessment

- The new `docs/design/networking.md` is comprehensive and well-structured. The file table at line 172-176 references a non-existent `hostswitch.go` (see S2).
- The `api_lima.md` change correctly links to the new networking doc under the `LimaNetwork` section.

## Acknowledged Limitations

- **Sequential hostagent shutdown** — `limavm_controller.go:508` has a TODO: "Wait on all hostagents in parallel instead of sequentially." The new `stopHostSwitch` call at line 543 adds to the per-instance shutdown time (blocked on host-switch goroutine exit). This slightly worsens the sequential-wait problem but is consistent with the existing pattern.
- **Windows signal support** — The memory notes that `os.Signal` sending is not implemented on Windows. The host-switch itself doesn't use signals (it uses context cancellation), so this limitation doesn't affect the new code.

## Unresolved Feedback

1. **Nino-K on `hostswitch_windows.go` — `return` after `vn.Listen` failure skips `g.Wait()`**
   URL: https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/280#discussion_r3054525099

   The current code at line 172-176 returns before `g` (the errgroup) is created at line 183, so `g.Wait()` is irrelevant at that point. However, the comment may have referred to an earlier revision where the errgroup was created first. The code is now correct — `ln.Close()` is called before the return — but the reviewer's concern has not been explicitly acknowledged in the PR.

2. **Nino-K on `hostswitch_windows.go` — missing `ln.Close()` before `return` after `virtualnetwork.New` failure**
   URL: https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/280#discussion_r3054530759

   The current code at line 163-167 does call `ln.Close()` before returning. This appears to be addressed.

## Coverage Summary

| File | Status |
|------|--------|
| `bats/tests/32-app-controllers/app.bats` | **S3** |
| `docs/design/api_lima.md` | Reviewed, no issues |
| `docs/design/networking.md` | **S2** |
| `go.mod` | Trivial (indirect → direct for already-present deps) |
| `pkg/controllers/app/app/lima-images-wsl2.yaml` | Reviewed, no issues |
| `pkg/controllers/lima/limavm/controllers/hostswitch_other.go` | Reviewed, no issues |
| `pkg/controllers/lima/limavm/controllers/hostswitch_windows.go` | **I1, I2, S1, S4** |
| `pkg/controllers/lima/limavm/controllers/hostswitch_windows_test.go` | Reviewed, no issues |
| `pkg/controllers/lima/limavm/controllers/limavm_controller.go` | Reviewed, no issues |
| `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` | **I1** |

### [Codex] — Pass 1

### Executive Summary

This change adds an in-process Windows host-switch for WSL2 Lima VMs, wires it into the LimaVM lifecycle, and switches the WSL2 image to the new openSUSE rootfs. The main correctness risk is in the new AF_VSOCK handshake code: its nested goroutine structure can panic on a closed channel during normal startup, and the long-running host-switch has no recovery path if its listener/API loop exits unexpectedly.

### Findings

**Critical Issues**

C1. **Nested handshake probes can panic after `found` is closed** — `pkg/controllers/lima/limavm/controllers/hostswitch_windows.go:315-335,360-385` (critical, regression)

```go
go func() {
    wg.Wait()
    close(found)
}()

case <-ticker.C:
    go func() {
        ...
        select {
        case found <- vmGUID:
```

`getVMGUID()` closes `found` at lines 333-335 as soon as the outer per-VM poller goroutines from lines 324-328 exit, but `attemptHandshake()` at lines 351-387 spawns a second layer of goroutines every second and never waits for them. That means cancellation can close `found` while older probe goroutines are still blocked in `getVsockConnection()` or `readSignature()`, and any later match at lines 380-385 will send on a closed channel and crash the controller. This is not just theoretical: the same VM is polled repeatedly, so once the guest service finally comes up there can be multiple in-flight probes to the matching VM.

Fix: remove the inner `go func()` entirely so each VM has at most one outstanding probe, or track every spawned probe in the `WaitGroup` and do not close `found` until all probe goroutines have finished. The simpler fix is one goroutine per VM with a blocking dial/read loop.

```diff
-        case <-ticker.C:
-            go func() {
-                conn, err := getVsockConnection(vmGUID, vsockHandshakePort)
-                ...
-                select {
-                case found <- vmGUID:
-                default:
-                }
-            }()
+        case <-ticker.C:
+            conn, err := getVsockConnection(vmGUID, vsockHandshakePort)
+            ...
+            select {
+            case found <- vmGUID:
+                return
+            case <-ctx.Done():
+                return
+            default:
+            }
```

**Important Issues**

I1. **Unexpected host-switch failures leave the VM running without any restart path** — `pkg/controllers/lima/limavm/controllers/hostswitch_windows.go:186-229` (important, regression)

```go
conn, err := ln.Accept()
if err != nil {
    // Listener closed (context cancelled).
    return nil
}
...
if err := g.Wait(); err != nil {
    logger.Error(err, "Host-switch exited with error")
}
```

`runHostSwitch()` treats every `ln.Accept()` error at lines 188-191 as an expected shutdown, and even true fatal errors from `s.Serve()` at lines 217-227 are only logged before the goroutine exits. Unlike the hostagent watcher, which explicitly enqueues reconciles on state changes in `hostagent_watcher.go:109`, `hostagent_watcher.go:116`, `hostagent_watcher.go:124`, and `hostagent_watcher.go:142`, the new host-switch path never calls `enqueueReconcile`, and `hostSwitchStates` at lines 116-140 only supports manual cancellation. After `startInstance()` launches the goroutine in `limavm_lifecycle.go:444-447`, there is no code path that notices an unexpected host-switch exit while the hostagent stays up, so the guest can lose DHCP/DNS/NAT permanently until the VM is manually restarted.

Fix: distinguish `ctx.Done()`/`net.ErrClosed` from real listener failures, record host-switch failure in controller state, and enqueue a reconcile when `runHostSwitch()` exits unexpectedly so the reconciler can restart the network stack or restart the instance.

```diff
-            if err != nil {
-                return nil
-            }
+            if err != nil {
+                if ctx.Err() != nil || errors.Is(err, net.ErrClosed) {
+                    return nil
+                }
+                return fmt.Errorf("vsock accept failed: %w", err)
+            }
```

**Suggestions**

None.

### Design Observations

**Concerns**

1. `(in-scope)` The hostagent path already has a watcher/state/reconcile loop, but the host-switch introduces a second long-lived component with only ad hoc stop/wait semantics. Extending the existing per-instance state machine to include host-switch health would remove the need for bespoke recovery logic and would have prevented I1.

**Strengths**

1. The lifecycle ordering is placed in the right layer: starting the host-switch before the hostagent in `startInstance()` and stopping it after watcher/hostagent teardown keeps the controller responsible for process coordination instead of pushing it into the guest.
2. The `hostswitch_other.go` stub keeps the Windows-only behavior isolated cleanly behind build tags.

### Testing Assessment

1. There is no test for the handshake race in C1: multiple in-flight probes to the same VM, followed by cancellation and channel closure.
2. There is no test that injects a host-switch listener/API failure and verifies that the controller requeues and recovers instead of leaving networking dead.
3. The only PR test change is a longer `Created` timeout in BATS; there is still no Windows integration coverage proving the new openSUSE WSL2 image boots and completes the host-switch handshake.

### Documentation Assessment

`docs/design/networking.md` needs a cleanup pass. The implementation table names `hostswitch.go`, but the repo only contains `hostswitch_windows.go`, `hostswitch_other.go`, and `hostswitch_windows_test.go`; the dependency section also says the new packages are only transitive via Lima, but `go.mod` now adds them as direct requirements. The crash-recovery section documents hostagent-triggered recovery, but it does not mention the missing host-switch recovery path described in I1.

### Coverage Summary

- `bats/tests/32-app-controllers/app.bats` — Reviewed, no issues
- `docs/design/api_lima.md` — Reviewed, no issues
- `docs/design/networking.md` — Reviewed, no issues
- `go.mod` — Reviewed, no issues
- `pkg/controllers/app/app/lima-images-wsl2.yaml` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/hostswitch_other.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/hostswitch_windows.go` — C1, I1
- `pkg/controllers/lima/limavm/controllers/hostswitch_windows_test.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/limavm_controller.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` — I1

### [Gemini] — Pass 1

### Executive Summary

This PR successfully implements the host-switch virtual network necessary for WSL2 instances running the opensuse distro to gain network connectivity over AF_VSOCK. The integration with the `LimaVM` controller lifecycle is structurally sound, and PR comments regarding early returns and listener closures have been addressed. However, there are critical concurrency flaws in the vsock handshake protocol that can cause the reconciler to hang indefinitely, along with goroutine leaks and dropped errors that must be resolved before merging.

### Findings

#### Critical Issues

C1. **Synchronous dial in handshake ignores context, causing indefinite reconciler hang** — `pkg/controllers/lima/limavm/controllers/hostswitch_windows.go:377-384`

```go
func signalListenerReady(vmGUID hvsock.GUID) error {
	conn, err := getVsockConnection(vmGUID, vsockHandshakePort)
	if err != nil {
		return err
	}
// ...
```

The second phase of the vsock handshake synchronously calls `getVsockConnection` (which invokes `hvsock.Dial`). Unlike phase 1 (`attemptHandshake`), this call does not respect the `hsCtx` timeout. If the guest is unresponsive or `hvsock.Dial` hangs indefinitely, `signalListenerReady` will never return. This blocks the `runHostSwitch` goroutine from exiting, which subsequently causes `stopHostSwitch` to block forever on `<-state.done`. Because `stopHostSwitch` is called during controller reconciliations (e.g., stopping/restarting an instance), this single hanging dial will permanently stall the controller-runtime worker queue for the `LimaVM` object.

Fix: Pass the context and enforce the timeout by detaching the potentially hanging dial into a goroutine.

```diff
-func signalListenerReady(vmGUID hvsock.GUID) error {
-	conn, err := getVsockConnection(vmGUID, vsockHandshakePort)
-	if err != nil {
-		return err
-	}
-	defer conn.Close()
-	_, err = conn.Write([]byte(readySignal))
-	return err
-}
+func signalListenerReady(ctx context.Context, vmGUID hvsock.GUID) error {
+	errCh := make(chan error, 1)
+	go func() {
+		conn, err := getVsockConnection(vmGUID, vsockHandshakePort)
+		if err != nil {
+			errCh <- err
+			return
+		}
+		defer conn.Close()
+		_, err = conn.Write([]byte(readySignal))
+		errCh <- err
+	}()
+
+	select {
+	case err := <-errCh:
+		return err
+	case <-ctx.Done():
+		return ctx.Err()
+	}
+}
```
*(You will also need to update the call site at line 290 to pass `hsCtx`: `signalListenerReady(hsCtx, vmGUID)`)*
*(critical, gap)*

#### Important Issues

I1. **Unbounded dial loop causes goroutine leakage** — `pkg/controllers/lima/limavm/controllers/hostswitch_windows.go:343-348`

```go
		case <-ticker.C:
			// Dial in a goroutine to avoid blocking on hvsock.Dial timeout.
			go func() {
				conn, err := getVsockConnection(vmGUID, vsockHandshakePort)
				if ctx.Err() != nil {
```

`attemptHandshake` blindly spawns a new dial goroutine every second. Because `hvsock.Dial` does not receive a context and can block for extended periods, these goroutines will accumulate if the VM is unresponsive or slow. Over the 5-minute `handshakeTimeout`, this can spawn up to 300 leaked, concurrent dials per running Hyper-V VM, overwhelming the vsock driver.

Fix: Use a non-blocking channel or a concurrency limit to ensure a new dial is only spawned if the previous one for that VM has finished.

```diff
 func attemptHandshake(ctx context.Context, logger logr.Logger, vmGUID hvsock.GUID, found chan<- hvsock.GUID) {
 	ticker := time.NewTicker(time.Second)
 	defer ticker.Stop()
+	dialing := make(chan struct{}, 1)
 
 	for {
 		select {
 		case <-ctx.Done():
 			return
 		case <-ticker.C:
-			// Dial in a goroutine to avoid blocking on hvsock.Dial timeout.
-			go func() {
-				conn, err := getVsockConnection(vmGUID, vsockHandshakePort)
+			select {
+			case dialing <- struct{}{}:
+				go func() {
+					defer func() { <-dialing }()
+					conn, err := getVsockConnection(vmGUID, vsockHandshakePort)
```
*(important, gap)*

I2. **Synchronous Accept blocks concurrent vsock connections** — `pkg/controllers/lima/limavm/controllers/hostswitch_windows.go:189-193`

```go
			if err := vn.AcceptStdio(ctx, conn); err != nil {
				logger.Error(err, "Failed to accept connection into virtual network")
			} else {
				logger.Info("Accepted vsock connection")
			}
```

The connection handling loop processes vsock connections synchronously. `gvisor-tap-vsock`'s `AcceptStdio` blocks until the connection is fully closed by the peer. Consequently, the host-switch can only handle exactly one vsock data connection at a time. If the guest's `vm-switch` crashes and reconnects before the host detects the disconnect (half-open connection), the new connection will block indefinitely in `ln.Accept()`.

Fix: Process accepted connections concurrently.

```diff
-			if err := vn.AcceptStdio(ctx, conn); err != nil {
-				logger.Error(err, "Failed to accept connection into virtual network")
-			} else {
-				logger.Info("Accepted vsock connection")
-			}
+			go func(c net.Conn) {
+				if err := vn.AcceptStdio(ctx, c); err != nil {
+					logger.Error(err, "Failed to accept connection into virtual network")
+				} else {
+					logger.Info("Accepted vsock connection")
+				}
+			}(conn)
```
*(important, gap)*

I3. **Host-switch failure leaves VM permanently stuck** — `pkg/controllers/lima/limavm/controllers/hostswitch_windows.go:155-159`

```go
	ln, err := vsockHandshake(ctx, logger)
	if err != nil {
		logger.Error(err, "Vsock handshake failed")
		return
	}
```

If `vsockHandshake` times out (or `virtualnetwork.New` fails), `runHostSwitch` logs an error and quietly exits. The controller is completely unaware that the background network provisioning failed. Because the guest's `network-setup.service` blocks on this handshake during early boot, the VM will hang permanently without triggering a controller reconcile or restart.

Fix: When `runHostSwitch` encounters a terminal failure, it must propagate this state to the controller so it can self-heal (e.g., by cancelling the hostagent watcher, which will trigger a `phaseStopped` reconcile and force a retry, or by updating the `LimaVM` Status directly).
*(important, gap)*

I4. **Indefinite read block during signature validation** — `pkg/controllers/lima/limavm/controllers/hostswitch_windows.go:367-374`

```go
func readSignature(conn net.Conn) (string, error) {
	buf := make([]byte, len(signaturePhrase))
	if _, err := io.ReadFull(conn, buf); err != nil {
```

`io.ReadFull` is called without establishing a read deadline on the connection. If the dial succeeds against an arbitrary VM process that accepts connections but sends no data, the read will block forever, permanently leaking the goroutine in `attemptHandshake`.

Fix: Set a read deadline before attempting to read the signature.
```diff
 func readSignature(conn net.Conn) (string, error) {
+	if err := conn.SetReadDeadline(time.Now().Add(5 * time.Second)); err != nil {
+		return "", err
+	}
 	buf := make([]byte, len(signaturePhrase))
```
*(important, gap)*

#### Suggestions

S1. **Exclude non-matching VMs from further handshake attempts** — `pkg/controllers/lima/limavm/controllers/hostswitch_windows.go:357-360`

If `readSignature` successfully reads a signature but it does not match `signaturePhrase`, `attemptHandshake` currently loops around and dials the identical incorrect VM again 1 second later. It will bombard that VM with connections for the full 5 minutes. Consider exiting the `attemptHandshake` goroutine entirely if a valid but non-matching signature is received, as we know with certainty it is not our target distro.

S2. **Defensive cancellation in `startHostSwitch`** — `pkg/controllers/lima/limavm/controllers/hostswitch_windows.go:114-118`

```go
	r.hostSwitchMu.Lock()
	r.hostSwitchStates[name] = &hostSwitchState{
```

If a state entry already exists for `name`, it is silently overwritten, leaking the old context cancel function. While `limavm_lifecycle.go` properly pairs stop/start calls today, it is safer to defensively invoke `r.stopHostSwitch(name)` prior to acquiring the lock and overwriting the state, ensuring robustness against future calling pattern changes.

### Design Observations

**Concerns**
- **Hardcoded Virtual Subnet Constraints (future):** The `192.168.127.0/24` subnet is hardcoded. While this traffic is entirely virtualized by gvisor, if the Windows host resides on a physical `192.168.127.0/24` network, the WSL VM will be unable to route traffic to actual machines on the physical LAN (as gvisor will intercept it locally). Allowing users to configure this subnet in the `LimaVM` spec would prevent routing collisions in enterprise environments.

### Testing Assessment

1. **Virtual Network Flow Testing:** The test suite correctly identifies that `app.bats` times out waiting for the `Created` condition without the network (hence the bump to 150s). However, there are no tests that empirically verify the network is functional once created. It is highly recommended to add an integration test that runs a command inside the VM (e.g., via `lima`) to assert that the `.2` static DHCP IP was properly assigned to the TAP interface and that DNS resolution for `host.rancher-desktop.internal` works as designed.

### Documentation Assessment

The newly added `docs/design/networking.md` is exceptionally well-written. It thoroughly outlines the architecture, components, and topology assumptions, providing superb context for maintainers.

### Acknowledged Limitations

No TODO limitations were promoted to findings. The PID recycling limitation noted in `limavm_lifecycle.go` is unaffected by this change.

### Unresolved Feedback

Both comments identified in the context (returning without closing the listener, and skipping `g.Wait()`) have been correctly addressed in the final submitted code of this PR.

### Coverage Summary

- `bats/tests/32-app-controllers/app.bats` — Trivial
- `docs/design/api_lima.md` — Trivial
- `docs/design/networking.md` — Reviewed, no issues
- `go.mod` — Trivial
- `pkg/controllers/app/app/lima-images-wsl2.yaml` — Trivial
- `pkg/controllers/lima/limavm/controllers/hostswitch_other.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/hostswitch_windows.go` — C1, I1, I2, I3, I4, S1, S2
- `pkg/controllers/lima/limavm/controllers/hostswitch_windows_test.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/limavm_controller.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` — Reviewed, no issues

## Resolution

| | |
|---|---|
| **Addressed** | 2026-04-08 |
| **Commit** | `a29eda1 Add host-switch virtual network for WSL2 instances` (amended) |

| # | Finding | Action |
|---|---------|--------|
| 1 | Important I1: Missing `stopHostSwitch` in `handleWatchedState` default case | Fixed |
| 2 | Important I2: Nested handshake goroutines can panic on closed `found` channel | Fixed |
| 3 | Important I3: Missing read deadline on `readSignature` | Fixed |
| 4 | Important I4: Host-switch failure has no recovery path | Commented |
| 5 | Important I5: `signalListenerReady` does not respect context or timeout | Fixed |
| 6 | Suggestion S1: Defensive double-start guard in `startHostSwitch` | Fixed |
| 7 | Suggestion S2: Design doc references non-existent `hostswitch.go` | Fixed (removed file table) |
| 8 | Suggestion S3: Timeout increase from 60s to 150s lacks comment | Commented |
| 9 | Suggestion S4: Exclude non-matching VMs from further handshake attempts | Fixed |
| 10 | Suggestion S5: Synchronous `AcceptStdio` blocks concurrent connections | Skipped |
| 11 | Testing Gap 1-4: Various testing gaps | Skipped |
| 12 | Documentation Gap 1-2: File table and dependency section | Skipped |