# Deep Review: 20260408-145125-pr-280

| | |
|---|---|
| **Date** | 2026-04-08 14:51 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 1 |
| **Author** | @jandubois |
| **PR** | [#280](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/280) — Add host-switch virtual network for WSL2 instances |
| **Branch** | `host-switch` |
| **Commits** | `a53d4b9` Add host-switch virtual network for WSL2 instances |
| **Reviewers** | Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro |
| **Verdict** | **Merge with fixes** — lifecycle integration gaps (I1-I2) and a concurrency bug (I3) need attention; the core networking logic is sound |
| **Wall-clock time** | `33 min 29 s` |

---

## Executive Summary

This PR ports the host-switch networking module into the LimaVM controller as an in-process goroutine, enabling WSL2 instances to get network connectivity via a gvisor-tap-vsock virtual L2 network over AF_VSOCK. The implementation is well-structured with clean platform separation and thorough design documentation. The key findings involve lifecycle integration: the host-switch goroutine can be leaked on restart and startup failure paths, and the vsock discovery loop spawns unbounded detached goroutines that risk a panic on channel close. All issues have straightforward fixes.

---

## Critical Issues

None.

---

## Important Issues

I1. **Missing `stopHostSwitch` in `handleRestartNeeded` leaks goroutine on restart** — `limavm_lifecycle.go:329-341` [Claude Opus 4.6] (important, gap)

```go
case phaseRunning:
    logger.Info("Restart needed: stopping running instance")
    r.shutdownHostagent(ctx, limaVM.Name, nil)
    r.stopWatcher(limaVM.Name)
```

Every other shutdown path calls `stopHostSwitch` alongside `stopWatcher`: `stopInstance` at line 502, `handleDeletion` at line 40, `handleWatchedState` crash recovery at line 213, and `shutdownAllHostagents` at line 547. The `handleRestartNeeded(phaseRunning)` branch is the only path that stops the watcher without stopping the host-switch. On the next reconcile, `startHostSwitch` overwrites the map entry (line 58) without cancelling the old goroutine, leaking a vsock listener, a virtual network, and the errgroup goroutines.

While `handleRestartNeeded` predates this PR (confirmed via `git blame`), the lifecycle pattern it should follow was introduced by this PR.

Fix:

```diff
 	case phaseRunning:
 		logger.Info("Restart needed: stopping running instance")
 		r.shutdownHostagent(ctx, limaVM.Name, nil)
 		r.stopWatcher(limaVM.Name)
+		r.stopHostSwitch(limaVM.Name)
```

---

I2. **`startHostSwitch` overwrites state without cancelling existing goroutine; leaked on `startInstance` failure** — `hostswitch_windows.go:54-65`, `limavm_lifecycle.go:446-469` [Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro] (important, regression)

```go
r.hostSwitchMu.Lock()
r.hostSwitchStates[name] = &hostSwitchState{
    cancel: hsCancel,
    done:   done,
}
r.hostSwitchMu.Unlock()
```

`startHostSwitch` unconditionally overwrites `hostSwitchStates[name]` at line 58. If the map already has an entry for this instance (from a previous call), the old goroutine's cancel function is discarded, and the goroutine runs until the manager shuts down.

This manifests through two paths. First, `startInstance` calls `startHostSwitch` at line 446, but neither the `haCmd.Start()` failure at line 454 nor the `waitForPIDFile` failure at line 463 calls `stopHostSwitch`. On retry, `startHostSwitch` orphans the previous goroutine. Second, even without `startInstance` failures, any code path that calls `startHostSwitch` twice for the same instance (e.g., I1's restart path) leaks the first goroutine.

Fix (two parts):

Make `startHostSwitch` idempotent — stop any existing goroutine before starting a new one:

```diff
 func (r *LimaVMReconciler) startHostSwitch(ctx context.Context, name string, inst *limatype.Instance) {
 	if inst.VMType != limatype.WSL2 {
 		return
 	}

+	r.stopHostSwitch(name)
+
 	hsCtx, hsCancel := context.WithCancel(ctx)
```

Add cleanup on `startInstance` error paths after line 446:

```diff
 	if err := haCmd.Start(); err != nil {
 		logger.Error(err, "Failed to start hostagent")
+		r.stopHostSwitch(limaVM.Name)
 		...
 		return ctrl.Result{}, err
 	}

 	if err := r.waitForPIDFile(haPIDPath, haStderrPath); err != nil {
 		logger.Error(err, "Hostagent did not start")
+		r.stopHostSwitch(limaVM.Name)
 		...
 		return ctrl.Result{}, err
 	}
```

---

I3. **Unbounded detached goroutines in `attemptHandshake` with potential panic on channel close** — `hostswitch_windows.go:289-327` [Codex GPT 5.4, Gemini 3.1 Pro] (important, regression)

```go
case <-ticker.C:
    // Dial in a goroutine to avoid blocking on hvsock.Dial timeout.
    go func() {
        conn, err := getVsockConnection(vmGUID, vsockHandshakePort)
        // ...
        if sig == signaturePhrase {
            select {
            case found <- vmGUID:
            default:
            }
        }
    }()
```

`attemptHandshake` spawns a new detached goroutine every second (line 299). Each goroutine blocks in `hvsock.Dial`, which accepts no context and has no built-in timeout. On a host with several unresponsive Hyper-V VMs, goroutines accumulate at one per GUID per second.

Worse, these detached goroutines outlive the WaitGroup. When `getVMGUID` receives a match from `found` and returns, `defer cancel()` fires, all `attemptHandshake` functions return, `wg.Wait()` completes, and `close(found)` executes. A detached goroutine that completed its dial before cancellation but reaches the `found <- vmGUID` send after `close(found)` triggers a panic. The `select { ... default: }` does not protect against sends on a closed channel.

Additionally, `readSignature` (line 331) and `signalListenerReady` (line 340) perform I/O without deadlines. A peer that connects but sends incomplete data blocks the goroutine indefinitely.

Fix: make the handshake sequential and add I/O deadlines. This eliminates the goroutine leak, the panic risk, and the missing-deadline problem in one change:

```go
func attemptHandshake(ctx context.Context, logger logr.Logger, vmGUID hvsock.GUID, found chan<- hvsock.GUID) {
	for {
		if ctx.Err() != nil {
			return
		}
		conn, err := getVsockConnection(vmGUID, vsockHandshakePort)
		if err != nil {
			select {
			case <-ctx.Done():
				return
			case <-time.After(time.Second):
			}
			continue
		}
		conn.SetReadDeadline(time.Now().Add(5 * time.Second))
		sig, err := readSignature(conn)
		conn.Close()
		if err != nil {
			logger.V(1).Info("Handshake read failed", "vmGUID", vmGUID.String(), "error", err)
			continue
		}
		if sig == signaturePhrase {
			logger.V(1).Info("Signature matched", "vmGUID", vmGUID.String())
			found <- vmGUID
			return
		}
		// Wrong VM — stop probing this GUID.
		return
	}
}
```

Also add a write deadline in `signalListenerReady`:

```diff
 func signalListenerReady(vmGUID hvsock.GUID) error {
 	conn, err := getVsockConnection(vmGUID, vsockHandshakePort)
 	if err != nil {
 		return err
 	}
 	defer conn.Close()
+	conn.SetWriteDeadline(time.Now().Add(5 * time.Second))
 	_, err = conn.Write([]byte(readySignal))
 	return err
 }
```

I4. **Vsock listener leaked on early return in `runHostSwitch`** — `hostswitch_windows.go:96-138` [Claude Opus 4.6] (important, regression)

```go
ln, err := vsockHandshake(ctx, logger)
if err != nil {
    logger.Error(err, "Vsock handshake failed")
    return
}

cfg := newVirtualNetworkConfig(*subnet)
vn, err := virtualnetwork.New(&cfg)
if err != nil {
    logger.Error(err, "Failed to create virtual network")
    return  // ln leaked
}
```

After `vsockHandshake` returns a listener at line 96, two error paths return without closing it: `virtualnetwork.New` failure (line 104) and `vn.Listen` failure (line 135). The listener holds an AF_VSOCK socket bound to a VM GUID + service ID. Each leaked listener blocks that service ID until the socket is garbage collected.

Fix: add explicit `ln.Close()` to each early-return path after line 96:

```diff
 	cfg := newVirtualNetworkConfig(*subnet)
 	vn, err := virtualnetwork.New(&cfg)
 	if err != nil {
+		ln.Close()
 		logger.Error(err, "Failed to create virtual network")
 		return
 	}
 ...
 	vnLn, err := vn.Listen("tcp", apiAddr)
 	if err != nil {
+		ln.Close()
 		logger.Error(err, "Failed to listen on API address", "addr", apiAddr)
 		return
 	}
```

---

## Suggestions

S1. **HTTP `Serve` returns a non-`ErrServerClosed` error on clean shutdown** — `hostswitch_windows.go:145-160` [Claude Opus 4.6, Gemini 3.1 Pro] (suggestion, regression)

```go
g.Go(func() error {
    <-ctx.Done()
    return vnLn.Close()
})
g.Go(func() error {
    s := &http.Server{...}
    err := s.Serve(vnLn)
    if err != nil && !errors.Is(err, http.ErrServerClosed) {
        return err
    }
    return nil
})
```

On shutdown, closing `vnLn` causes `s.Serve` to return a network error (not `http.ErrServerClosed`, which only `Server.Shutdown`/`Server.Close` produce). This passes the error filter, causing `g.Wait()` to return an error and log "Host-switch exited with error" on every clean shutdown.

Fix: also filter `net.ErrClosed`:

```diff
 		err := s.Serve(vnLn)
-		if err != nil && !errors.Is(err, http.ErrServerClosed) {
+		if err != nil && !errors.Is(err, http.ErrServerClosed) && !errors.Is(err, net.ErrClosed) {
 			return err
 		}
```

---

S2. **No unit tests for platform-independent helpers** — `hostswitch.go`, `hostswitch_windows.go` [Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro] (suggestion, regression)

`validateSubnet` is pure logic in `hostswitch.go` (no build tag) and trivially testable on any platform. `newVirtualNetworkConfig` maps the subnet struct into gvisor-tap-vsock configuration and is also portable.

Fix: add `hostswitch_test.go` (no build tag) with table-driven tests for `validateSubnet` covering the default subnet, a custom subnet, an invalid CIDR, and an IPv6 CIDR. Optionally test `newVirtualNetworkConfig` to verify DNS zones, DHCP leases, and NAT mappings match the subnet.

---

## Design Observations

**Concerns**

- **Host-switch failure is invisible to the reconciler** (future) — `startHostSwitch` is fire-and-forget: it spawns a goroutine and returns immediately. If the goroutine fails (handshake timeout, virtual network error), the reconciler has no signal. The guest's `network-setup.service` blocks, the hostagent eventually times out, and the watcher reports `phaseStopped` — but the root cause is buried in logs. A future improvement could have the goroutine report failure back to the reconciler (via a status field or reconcile enqueue) so networking failures surface in the LimaVM conditions. [Codex GPT 5.4, Claude Opus 4.6]

**Strengths**

- Clean platform separation: shared types in `hostswitch.go`, Windows implementation in `hostswitch_windows.go`, no-op stubs in `hostswitch_other.go`. The `!windows` build tag ensures zero overhead on non-Windows platforms. [Claude Opus 4.6]
- The design doc (`networking.md`) is thorough — architecture diagram, protocol specification, lifecycle sequence diagrams, and future work. It gives future reviewers a concrete model instead of leaving the Windows networking path implicit. [Claude Opus 4.6, Codex GPT 5.4]
- The handshake design handles multi-VM discovery correctly: parallel signature checks per GUID with a buffered channel for first-match semantics. [Claude Opus 4.6]
- Promoting `go-winio`, `gvisor-tap-vsock`, and `virtsock` from indirect to direct dependencies correctly reflects their new status as first-party imports. [Claude Opus 4.6]

---

## Testing Assessment

1. **No integration tests for host-switch lifecycle** — The host-switch requires a running WSL2 VM, which limits integration testing. However, the lifecycle integration (start/stop ordering, error path cleanup) could be tested by verifying state map operations and cancel function calls with mock interfaces.
2. **No unit tests for `validateSubnet`** — Pure function, platform-independent, trivially testable (see S2).
3. **No unit tests for `newVirtualNetworkConfig`** — Verifying DNS zones, DHCP leases, and NAT mappings catches configuration regressions.
4. **No test for the host-switch dying after startup succeeds** — The most important recovery path introduced by this PR is unverified.

---

## Documentation Assessment

The `networking.md` design doc is comprehensive and well-structured. The `api_lima.md` cross-reference is appropriate. Once I1 is fixed, the doc should describe what happens when the host-switch exits unexpectedly — the crash recovery section covers the hostagent but not the host-switch.

---

## Agent Performance Retro

### [Claude]

Claude read all 9 changed files plus the hostagent watcher, then ran 7 `git blame` commands to verify which changes were introduced by this PR. It identified the most architecturally significant finding (I1: missing `stopHostSwitch` in `handleRestartNeeded`) by systematically tracing every shutdown path and confirming the gap with `git blame`. It also found the vsock listener leak (I4) that both other agents missed. The HTTP shutdown issue was correctly scoped as a suggestion.

Claude did not flag the `attemptHandshake` goroutine leak or the panic risk. Its Coverage Summary accounts for all files, but the handshake concurrency analysis was shallow — it praised the parallel design as a strength without examining the goroutine lifecycle.

**Tool call highlights**: 23 calls. 10 Read, 8 Bash (7 `git blame`, 1 `git diff`), 5 Grep. Heavy `git blame` usage explains the accurate regression vs. gap classifications.

### [Codex]

Codex made 52 tool calls — the highest count — reading all changed files, test files, the hostagent watcher, and even upstream dependency source code (`gvisor-tap-vsock`, `virtsock`). It ran `go test` to check for existing test coverage and verified the WSL2 image integration path. The finding about host-switch failures being invisible to the reconciler (folded into the design observation) demonstrates good architectural thinking.

Codex identified the goroutine overwrite (I2) and unbounded probing (I3) but did not flag the panic risk on the closed channel or the missing `handleRestartNeeded` path (I1). It also missed the vsock listener leak (I4).

**Tool call highlights**: 52 calls. 51 `exec_command`, 1 `write_stdin`. Shell commands: 21 `sed -n` (file reads), 11 `nl` (numbered listings), 9 `rg` (search), 5 `git blame`. Codex used `go doc` to inspect upstream APIs and `go test` to verify test state — more thorough upstream analysis than the other agents.

### [Gemini]

Gemini made 13 tool calls — the fewest. It focused on the hostswitch_windows.go file (3 reads) and attempted a live build (`make build`). It identified the closed-channel panic (part of I3) that Codex missed, provided a complete sequential-handshake fix, and flagged the missing I/O deadlines. The HTTP server shutdown finding was initially rated important but is correctly a suggestion.

Gemini raised a concern about context lifetime decoupling — that the reconcile context might be per-reconcile in controller-runtime v0.15+. This is invalid for v0.23.3, where the reconcile context is the manager's context (the existing `startWatcher` uses the same pattern). Gemini did not run `git blame`, so it could not distinguish regressions from pre-existing issues.

**Tool call highlights**: 13 calls. 7 `grep_search`, 4 `read_file`, 2 `run_shell_command` (1 attempted live test, 1 `make build`). No `git blame` usage — consistent with the known limitation.

### Summary

| | Claude Opus 4.6 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 340s | 228s | 257s |
| Findings | 0C 2I 2S | 0C 2I 1S | 0C 2I 1S |
| Tool calls | 23 (Read 10, Bash 8, Grep 5) | 52 (exec_command 51) | 13 (grep_search 7, read_file 4) |
| Design observations | 1 concern, 4 strengths | 1 concern, 1 strength | 1 concern (invalid) |
| False positives | 0 | 0 | 1 (context lifetime) |
| Unique insights | I1 (handleRestartNeeded), I4 (listener leak) | upstream API inspection | Panic on closed channel, I/O deadlines |
| Files reviewed | 9/9 | 9/9 | 7/9 |
| Coverage misses | 0 | 0 | 2 (api_lima.md, lima-images-wsl2.yaml marked trivial) |

**Reconciliation**: Gemini C1 (critical, goroutine leak + panic): downgraded to important I3 — the panic requires a specific timing window (detached goroutine must complete dial before context cancellation but send after channel close) that is unlikely under normal operation. Gemini C2 (critical, missing I/O deadlines): merged into I3 fix recommendation as part of the sequential handshake — the deadlines are needed for the fix to work correctly, not a standalone critical issue. Gemini I2 (important, HTTP shutdown): downgraded to suggestion S1 — spurious log does not affect functionality. Codex I1 (important, host-switch failure invisible): reclassified as design observation — the system self-heals through the watcher; visibility is a future improvement.

---

## Review Process Notes

**Repo context updates**

- [repo] Add: "The host-switch goroutine (`hostswitch_windows.go`) uses fire-and-forget lifecycle — failures are logged but not surfaced to the reconciler. When reviewing changes to the host-switch, verify that every shutdown path in `limavm_lifecycle.go` calls `stopHostSwitch` alongside `stopWatcher`." Why: Claude's I1 finding (missing `stopHostSwitch` in `handleRestartNeeded`) was unique — the other two agents missed this gap because they did not systematically enumerate all shutdown paths. Future reviews of lifecycle changes need this context.

---

## Appendix: Original Reviews

### [Claude] — Pass 1

Acknowledged — that background task also completed, but its results were already used in the review (confirmed `DHCPStaticLeases` is `IP → MAC`, matching the PR's implementation).
pensuse distro to get network connectivity via a gvisor-tap-vsock virtual L2 network over AF_VSOCK. The implementation is well-structured with clean platform separation (`hostswitch_windows.go` / `hostswitch_other.go`) and thorough design documentation. The key findings are: a missing `stopHostSwitch` call in `handleRestartNeeded` that leaks the goroutine on restart, a vsock listener leak in `runHostSwitch` error paths, and missing cleanup when `startInstance` fails after the host-switch is already launched.

**Verdict:** Approve with fixes for I1–I3.

---

## Findings

### Critical Issues

None.

### Important Issues

I1. **Missing `stopHostSwitch` in `handleRestartNeeded` leaks goroutine on restart** — `limavm_lifecycle.go:329-341` (important, gap)

```go
case phaseRunning:
    logger.Info("Restart needed: stopping running instance")
    r.shutdownHostagent(ctx, limaVM.Name, nil)
    r.stopWatcher(limaVM.Name)

    // Clear restartNeeded and set Stopped condition in one write.
    patch := client.MergeFrom(limaVM.DeepCopy())
```

Every other shutdown path calls `stopHostSwitch` alongside `stopWatcher`: `stopInstance` at line 502, `handleDeletion` at line 40, `handleWatchedState` crash recovery at line 213, `shutdownAllHostagents` at line 547. `handleRestartNeeded(phaseRunning)` is the only path that stops the watcher but not the host-switch. On the subsequent reconcile, `startInstance` at line 446 calls `startHostSwitch`, which overwrites the map entry at `hostswitch_windows.go:57-62` without cancelling the old goroutine. The old goroutine runs until the manager shuts down — leaking a vsock listener, a virtual network, and several goroutines from the errgroup.

While `handleRestartNeeded` predates this PR (commit `776b8f77`), the pattern it should follow was introduced by this PR — it is a gap in the lifecycle integration.

Fix:

```diff
 	case phaseRunning:
 		logger.Info("Restart needed: stopping running instance")
 		r.shutdownHostagent(ctx, limaVM.Name, nil)
 		r.stopWatcher(limaVM.Name)
+		r.stopHostSwitch(limaVM.Name)
```

---

I2. **Vsock listener leaked on early return in `runHostSwitch`** — `hostswitch_windows.go:96-138` (important, regression)

```go
ln, err := vsockHandshake(ctx, logger)
if err != nil {
    logger.Error(err, "Vsock handshake failed")
    return
}

cfg := newVirtualNetworkConfig(*subnet)
vn, err := virtualnetwork.New(&cfg)
if err != nil {
    logger.Error(err, "Failed to create virtual network")
    return  // ln leaked
}
```

After `vsockHandshake` succeeds and returns `ln` at line 96, two error paths return without closing it: `virtualnetwork.New` failure at line 104 and `vn.Listen` failure at line 135. The vsock listener holds an AF_VSOCK socket bound to a specific VM GUID + service ID. The leak is bounded (one per restart attempt) and self-heals on process exit, but it blocks the same service ID for that VM GUID until the socket is GC'd.

Fix: close `ln` on every early-return path after line 96. A `defer` would work cleanly here since `ln` is only needed by the errgroup goroutines, which keep it alive via closure:

```diff
 	ln, err := vsockHandshake(ctx, logger)
 	if err != nil {
 		logger.Error(err, "Vsock handshake failed")
 		return
 	}
+	// The errgroup's listener-closer (ln.Close on ctx.Done) takes over
+	// once the errgroup starts. This defer handles early-return paths
+	// between here and errgroup startup.
+	lnClosed := false
+	defer func() {
+		if !lnClosed {
+			ln.Close()
+		}
+	}()
```

Then set `lnClosed = true` after the errgroup goroutines that close `ln` are registered (after line 131). Alternatively, the simpler approach: add explicit `ln.Close()` calls to the two error paths.

---

I3. **Host-switch not stopped when `startInstance` fails after launching it** — `limavm_lifecycle.go:446-469` (important, regression)

```go
r.startHostSwitch(ctx, limaVM.Name, inst)       // line 446

// Start hostagent in background.
haCmd := exec.CommandContext(ctx, rddPath, args...)
...
if err := haCmd.Start(); err != nil {            // line 454
    ...
    return ctrl.Result{}, err                    // host-switch still running
}

if err := r.waitForPIDFile(haPIDPath, haStderrPath); err != nil {  // line 463
    ...
    return ctrl.Result{}, err                    // host-switch still running
}
```

`startHostSwitch` is called at line 446, but the error paths at lines 454–459 (`haCmd.Start` failure) and 463–468 (`waitForPIDFile` failure) return without stopping the host-switch. On the next retry, `startHostSwitch` overwrites the map entry and orphans the old goroutine — the same mechanism as I1.

Fix: stop the host-switch on each error path in `startInstance` after line 446:

```diff
 	if err := haCmd.Start(); err != nil {
 		logger.Error(err, "Failed to start hostagent")
+		r.stopHostSwitch(limaVM.Name)
 		if updateErr := r.updateCondition(ctx, limaVM, ConditionRunning, metav1.ConditionFalse, ReasonStartFailed, err.Error()); updateErr != nil {
 			logger.Error(updateErr, "Failed to update status after hostagent start failure")
 		}
 		return ctrl.Result{}, err
 	}
```

And the same for the `waitForPIDFile` error path at line 463–468.

---

### Suggestions

S1. **HTTP `Serve` returns a non-`ErrServerClosed` error on shutdown** — `hostswitch_windows.go:145-160` (suggestion, regression)

```go
g.Go(func() error {
    <-ctx.Done()
    return vnLn.Close()
})
g.Go(func() error {
    s := &http.Server{...}
    err := s.Serve(vnLn)
    if err != nil && !errors.Is(err, http.ErrServerClosed) {
        return err
    }
    return nil
})
```

When the context is cancelled, the goroutine at line 145 closes `vnLn`. `s.Serve` at line 155 then returns a closed-listener error, which is NOT `http.ErrServerClosed` (that constant is only returned by `Server.Shutdown`/`Server.Close`). This causes `g.Wait()` to return an error, and the shutdown is logged as "Host-switch exited with error" at line 165 — noisy but not broken.

Fix: also check for `net.ErrClosed`:

```diff
 		err := s.Serve(vnLn)
-		if err != nil && !errors.Is(err, http.ErrServerClosed) {
+		if err != nil && !errors.Is(err, http.ErrServerClosed) && !errors.Is(err, net.ErrClosed) {
 			return err
 		}
```

---

S2. **No unit tests for `validateSubnet` or the virtual network config builder** — `hostswitch.go`, `hostswitch_windows.go` (suggestion, regression)

`validateSubnet` is pure logic that derives gateway, DHCP lease, and DNS host addresses from a CIDR string. It's platform-independent (in `hostswitch.go`) and easily testable. `newVirtualNetworkConfig` maps the subnet struct into the gvisor-tap-vsock configuration and is also testable on any platform.

Fix: add a `hostswitch_test.go` (no build tag) with table-driven tests for `validateSubnet` covering at least: the default subnet, a custom subnet, an invalid CIDR, and an IPv6 CIDR.

---

## Design Observations

**Concerns**

- **Host-switch failure is invisible to the reconciler** (future) — `startHostSwitch` is fire-and-forget: it spawns a goroutine and returns immediately. If the goroutine fails (handshake timeout, virtual network creation error), the reconciler has no signal. The guest's `network-setup.service` blocks on the handshake, the hostagent eventually times out, and the watcher reports `phaseStopped` — but the root cause (networking failure) is buried in logs. A future improvement could have the goroutine report its status back (e.g., via a channel or condition) so the reconciler can surface networking failures in the LimaVM status.

**Strengths**

- Clean platform separation: shared types and validation in `hostswitch.go`, Windows implementation in `hostswitch_windows.go`, no-op stubs in `hostswitch_other.go`. The `!windows` build tag ensures zero cost on non-Windows platforms.
- The design doc (`networking.md`) is excellent — architecture diagram, protocol specification, lifecycle integration sequence diagrams, and future work. It makes the intent clear for future reviewers.
- The handshake design correctly handles the multi-VM discovery problem (multiple Hyper-V VMs) with parallel signature checks and a buffered channel for first-match semantics.
- Promoting `go-winio`, `gvisor-tap-vsock`, and `virtsock` from indirect to direct dependencies correctly reflects that they are now first-party imports, not just transitive dependencies.

---

## Testing Assessment

1. **No integration tests for host-switch lifecycle** — The host-switch cannot be tested without a running WSL2 VM, which is reasonable. However, the lifecycle integration (start/stop ordering, error path cleanup) could be tested with mock interfaces or by verifying state map operations.
2. **No unit tests for `validateSubnet`** — Pure function, platform-independent, trivially testable (see S2).
3. **No unit tests for `newVirtualNetworkConfig`** — Verifying the DNS zone, DHCP lease, and NAT mappings are correctly assembled from the subnet would catch configuration regressions.

---

## Documentation Assessment

The `networking.md` design doc is comprehensive and well-structured. The `api_lima.md` cross-reference addition is appropriate. No gaps identified.

---

## Acknowledged Limitations

None explicitly acknowledged in TODO comments within the changed code.

---

## Coverage Summary

| File | Status |
|------|--------|
| `docs/design/api_lima.md` | Reviewed, no issues |
| `docs/design/networking.md` | Reviewed, no issues |
| `go.mod` | Reviewed, no issues — indirect→direct promotion is correct |
| `pkg/controllers/app/app/lima-images-wsl2.yaml` | Reviewed, no issues — image URL and digest updated |
| `pkg/controllers/lima/limavm/controllers/hostswitch.go` | S2 |
| `pkg/controllers/lima/limavm/controllers/hostswitch_other.go` | Reviewed, no issues |
| `pkg/controllers/lima/limavm/controllers/hostswitch_windows.go` | I2, S1, S2 |
| `pkg/controllers/lima/limavm/controllers/limavm_controller.go` | Reviewed, no issues |
| `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` | I1, I3 |

### [Codex] — Pass 1

### Executive Summary

This PR adds an in-process WSL2 host-switch to the LimaVM controller, wires it into the Lima lifecycle, and switches the app’s WSL2 image to the opensuse distro that depends on this networking path. The main issues are both in the new host-switch lifecycle: the goroutine is not stopped on restart/start-failure paths, and the handshake fan-out closes its result channel while untracked probe goroutines can still send to it.

### Findings

**Critical Issues**

None.

**Important Issues**

I1. **Host-switch state leaks across restart and startup-failure paths** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:330-341`, `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:443-468`, `pkg/controllers/lima/limavm/controllers/hostswitch_windows.go:57-65` (important, regression)

```go
case phaseRunning:
    logger.Info("Restart needed: stopping running instance")
    r.shutdownHostagent(ctx, limaVM.Name, nil)
    r.stopWatcher(limaVM.Name)

    patch := client.MergeFrom(limaVM.DeepCopy())
    limaVM.Status.RestartNeeded = false
```

`startInstance()` now creates a host-switch at line 446 before the hostagent starts, but the error returns at lines 454-459 and 463-468 leave that goroutine running. The restart path at lines 330-341 also shuts down only the hostagent and watcher, then immediately returns without calling `stopHostSwitch()`. Because `startHostSwitch()` overwrites `hostSwitchStates[name]` at line 58, the next start loses the only cancel handle for the old goroutine, so repeated restart/template-update cycles can accumulate orphan host-switches and eventually collide on the next `vsockListen()`/virtual-network startup. `docs/design/networking.md:98-164` says the host-switch stops after the hostagent and watcher, but the restart implementation does not enforce that.

Fix: make host-switch teardown part of every path that can leave `startInstance()` after line 446, and stop it in `handleRestartNeeded()` before returning. A small helper that starts both hostagent and host-switch and unwinds both on failure would make the lifecycle harder to miss.

```diff
 case phaseRunning:
     logger.Info("Restart needed: stopping running instance")
     r.shutdownHostagent(ctx, limaVM.Name, nil)
     r.stopWatcher(limaVM.Name)
+    r.stopHostSwitch(limaVM.Name)

     patch := client.MergeFrom(limaVM.DeepCopy())
     limaVM.Status.RestartNeeded = false
     r.setCondition(limaVM, ConditionRunning, metav1.ConditionFalse, ReasonStopped, "Stopped for restart")
     return ctrl.Result{}, r.Status().Patch(ctx, limaVM, patch)
```

```diff
 r.startHostSwitch(ctx, limaVM.Name, inst)

 if err := haCmd.Start(); err != nil {
+    r.stopHostSwitch(limaVM.Name)
     logger.Error(err, "Failed to start hostagent")
     ...
 }

 if err := r.waitForPIDFile(haPIDPath, haStderrPath); err != nil {
+    r.stopHostSwitch(limaVM.Name)
     logger.Error(err, "Hostagent did not start")
     ...
 }
```

I2. **Handshake probes can panic by sending to `found` after it has been closed** — `pkg/controllers/lima/limavm/controllers/hostswitch_windows.go:269-273`, `pkg/controllers/lima/limavm/controllers/hostswitch_windows.go:298-325` (important, regression)

```go
go func() {
    wg.Wait()
    close(found)
}()

go func() {
    ...
    if sig == signaturePhrase {
        select {
        case found <- vmGUID:
        default:
        }
    }
}()
```

`getVMGUID()` closes `found` at lines 271-273 after the outer `attemptHandshake()` goroutines exit, but each outer goroutine spawns untracked inner probe goroutines on every tick at lines 298-325. The comment at line 298 explicitly says `hvsock.Dial` can block, so overlap is expected. Once `getVMGUID()` returns its first match or times out, `cancel()` makes the outers exit, `wg.Wait()` closes `found`, and any late inner probe that finishes `readSignature()` and hits the send at line 321 will panic the controller on `send on closed channel`.

Fix: do not spawn fire-and-forget probes. Either probe synchronously once per VM per loop, or track the inner goroutines in the same wait group and close `found` only after they have all exited. The simpler fix is to remove the nested `go func()` and make each `attemptHandshake()` own a single in-flight dial at a time.

```diff
-        case <-ticker.C:
-            go func() {
-                conn, err := getVsockConnection(vmGUID, vsockHandshakePort)
-                ...
-                if sig == signaturePhrase {
-                    select {
-                    case found <- vmGUID:
-                    default:
-                    }
-                }
-            }()
+        case <-ticker.C:
+            conn, err := getVsockConnection(vmGUID, vsockHandshakePort)
+            if ctx.Err() != nil {
+                if conn != nil {
+                    conn.Close()
+                }
+                return
+            }
+            ...
+            if sig == signaturePhrase {
+                select {
+                case found <- vmGUID:
+                case <-ctx.Done():
+                }
+                return
+            }
```

**Suggestions**

None.

### Design Observations

**Concerns**

1. `(in-scope)` The host-switch now has its own `hostSwitchStates` map in [pkg/controllers/lima/limavm/controllers/limavm_controller.go](/private/tmp/review-codex-CJcX1Y/pkg/controllers/lima/limavm/controllers/limavm_controller.go#L140), separate from the existing `instanceState` lifecycle. That split is what let restart/failed-start paths stop the hostagent and watcher but forget the host-switch. Folding the host-switch handle into `instanceState`, or centralizing all start/stop transitions behind one helper, would remove an entire class of teardown mismatches.

**Strengths**

1. `(in-scope)` The build-tag split between [hostswitch_windows.go](/private/tmp/review-codex-CJcX1Y/pkg/controllers/lima/limavm/controllers/hostswitch_windows.go), [hostswitch_other.go](/private/tmp/review-codex-CJcX1Y/pkg/controllers/lima/limavm/controllers/hostswitch_other.go), and the shared subnet/config code in [hostswitch.go](/private/tmp/review-codex-CJcX1Y/pkg/controllers/lima/limavm/controllers/hostswitch.go) keeps the Windows-only complexity contained.

2. `(in-scope)` [docs/design/networking.md](/private/tmp/review-codex-CJcX1Y/docs/design/networking.md) is unusually detailed for a controller-side protocol change; the sequence diagrams make the intended lifecycle easy to review against the code.

### Testing Assessment

Existing tests pass with `go test ./pkg/controllers/lima/limavm/...`, but they do not cover the new Windows-specific host-switch paths.

1. No test exercises the WSL2 restart path (`status.restartNeeded`, template-change restart, or `rdd limavm restart`) with the opensuse image and asserts that the old host-switch is stopped before the new one starts.
2. No test covers hostagent startup failure after `startHostSwitch()` but before the watcher starts, so the leaked-host-switch path at [limavm_lifecycle.go:443-468](/private/tmp/review-codex-CJcX1Y/pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go#L443) is currently untested.
3. No test covers handshake cancellation/timeout, so the `found`-channel race in [hostswitch_windows.go:269-325](/private/tmp/review-codex-CJcX1Y/pkg/controllers/lima/limavm/controllers/hostswitch_windows.go#L269) would only show up as a runtime panic on Windows.

### Documentation Assessment

[docs/design/networking.md:98-164](/private/tmp/review-codex-CJcX1Y/docs/design/networking.md#L98) says the host-switch stops after the hostagent and watcher and that crash recovery starts “fresh,” but the current restart implementation in I1 does not do that yet. The doc should be kept as-is if I1 is fixed before merge; otherwise it overstates the implemented lifecycle.

### Coverage Summary

- `docs/design/api_lima.md` — Reviewed, no issues
- `docs/design/networking.md` — I1
- `go.mod` — Reviewed, no issues
- `pkg/controllers/app/app/lima-images-wsl2.yaml` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/hostswitch.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/hostswitch_other.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/hostswitch_windows.go` — I1, I2
- `pkg/controllers/lima/limavm/controllers/limavm_controller.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` — I1
o` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` — I1, I2, S1

### [Gemini] — Pass 1

An excellent architectural addition to handle WSL2's networking limitations without relying on the host's brittle network bridges. Integrating `gvisor-tap-vsock` natively over AF_VSOCK is a robust design. However, the lifecycle management of the host-switch goroutine contains several critical flaws that will cause it to instantly fail on cold boots, leak goroutines, and unexpectedly exit.

### Executive Summary

This PR successfully ports the host-switch virtual network architecture for WSL2 instances and correctly integrates `gvisor-tap-vsock` with Lima's networking model. However, the implementation introduces critical bugs in the goroutine lifecycle: the Reconcile context cancellation instantly kills the host-switch, the registry discovery only runs once before the VM exists, and the vsock polling loops leak blocked goroutines and network connections.

### Critical Issues

C1. **Reconcile context cancellation immediately kills host-switch goroutine** (critical, regression) — `pkg/controllers/lima/limavm/controllers/hostswitch_windows.go:49`

```go
	hsCtx, hsCancel := context.WithCancel(ctx)
	done := make(chan struct{})
```

The `ctx` passed into `startHostSwitch` originates from the `Reconcile` loop. In `controller-runtime`, the Reconcile context is cancelled the exact moment the `Reconcile` method returns — which happens immediately after the asynchronous `startInstance` returns. Consequently, `hsCtx` is cancelled instantly, causing the `runHostSwitch` goroutine to shut down before the VM even boots.

Fix: The host-switch must outlive the Reconcile loop. Its shutdown is already explicitly managed via `stopHostSwitch`. Per the PR instructions, this is a valid exception for `context.Background()`. Update it and include an explanatory comment:

```diff
-	hsCtx, hsCancel := context.WithCancel(ctx)
+	// Must use context.Background() because the Reconcile context is cancelled
+	// when the Reconcile function returns, but this goroutine must run for the
+	// lifetime of the VM instance. Shutdown is propagated via hsCancel.
+	hsCtx, hsCancel := context.WithCancel(context.Background())
```

C2. **Host-switch fails on cold boot due to single registry check** (critical, regression) — `pkg/controllers/lima/limavm/controllers/hostswitch_windows.go:193`

```go
	names, err := key.ReadSubKeyNames(0)
	key.Close()
	if len(names) == 0 {
		return hvsock.GUIDZero, errors.New("no running Hyper-V VMs found")
	}
```

Because `startHostSwitch` is intentionally called *before* `haCmd.Start()`, the WSL2 utility VM has not booted yet and is not present in the `VolatileStore\ComputeSystem` registry key. `getVMGUID` reads this key exactly once. On a cold boot, it instantly fails with "no running Hyper-V VMs found" and the host-switch exits. If other unrelated VMs are running, it polls those specific VMs but never re-checks the registry, permanently missing our VM when it eventually boots.

Fix: `getVMGUID` must periodically poll the registry to discover new VMs until the VM connects or the handshake times out.

C3. **Unbounded goroutine leak in `attemptHandshake`** (critical, regression) — `pkg/controllers/lima/limavm/controllers/hostswitch_windows.go:214`

```go
		case <-ticker.C:
			// Dial in a goroutine to avoid blocking on hvsock.Dial timeout.
			go func() {
				conn, err := getVsockConnection(vmGUID, vsockHandshakePort)
// ...
				sig, err := readSignature(conn)
```

`attemptHandshake` blindly spawns a detached goroutine every second. `getVsockConnection` (which wraps `hvsock.Dial`) and `readSignature` (which uses `io.ReadFull`) do not enforce deadlines. If a VM's vsock connection blocks or hangs (e.g. connects but never sends the phrase), these goroutines block indefinitely. Checking `ctx.Err()` *after* the dial completes does not unblock them. When `getVMGUID` successfully finds the VM, the context cancels and `attemptHandshake` returns, permanently leaking all these stuck goroutines.

Fix: Limit to a single active dial attempt per VM, and set a read deadline before `io.ReadFull`.

```go
func attemptHandshake(ctx context.Context, logger logr.Logger, vmGUID hvsock.GUID, found chan<- hvsock.GUID) {
	ticker := time.NewTicker(time.Second)
	defer ticker.Stop()

	var dialing int32

	for {
		select {
		case <-ctx.Done():
			return
		case <-ticker.C:
			if !atomic.CompareAndSwapInt32(&dialing, 0, 1) {
				continue // Previous dial still in progress
			}
			go func() {
				defer atomic.StoreInt32(&dialing, 0)
				conn, err := getVsockConnection(vmGUID, vsockHandshakePort)
				if err != nil {
					return
				}
				defer conn.Close()

				conn.SetReadDeadline(time.Now().Add(5 * time.Second))
				sig, err := readSignature(conn)
				if err == nil && sig == signaturePhrase {
					select {
					case found <- vmGUID:
					default:
					}
				}
			}()
		}
	}
}
```

### Important Issues

I1. **`validateSubnet` silently ignores CIDR mask length** (important, gap) — `pkg/controllers/lima/limavm/controllers/hostswitch.go:42`

```go
	ip, _, err := net.ParseCIDR(subnet)
	if err != nil {
		return nil, fmt.Errorf("invalid subnet %q: %w", subnet, err)
	}
	ipv4 := ip.To4()
```

`validateSubnet` extracts the `ip` address and blindly modifies the fourth octet to calculate the gateway and DNS endpoints, completely ignoring the parsed `network` mask. It implicitly assumes a `/24` prefix. If the subnet string were ever updated to a `/25` or `/16`, this calculation would silently generate invalid IP addresses outside the correct network boundaries.

Fix: Extract `network.IP` from the `*net.IPNet` and validate the mask size:

```diff
-	ip, _, err := net.ParseCIDR(subnet)
+	_, network, err := net.ParseCIDR(subnet)
	if err != nil {
		return nil, fmt.Errorf("invalid subnet %q: %w", subnet, err)
	}
+	if ones, bits := network.Mask.Size(); ones != 24 || bits != 32 {
+		return nil, fmt.Errorf("subnet %q must be a /24 IPv4 CIDR", subnet)
+	}
-	ipv4 := ip.To4()
+	ipv4 := network.IP.To4()
```

I2. **Closing listener directly causes `Serve` to return unhandled error** (important, regression) — `pkg/controllers/lima/limavm/controllers/hostswitch_windows.go:138`

```go
	// Start the port-forwarding API server inside the virtual network.
	apiAddr := fmt.Sprintf("%s:80", cfg.GatewayIP)
	vnLn, err := vn.Listen("tcp", apiAddr)
// ...
	g.Go(func() error {
		<-ctx.Done()
		return vnLn.Close()
	})
```

The shutdown routine directly closes the network listener `vnLn`. This forces the blocked `s.Serve(vnLn)` call to return a `net.ErrClosed` network error rather than `http.ErrServerClosed`. The `errgroup` captures this network error, causing `g.Wait()` to incorrectly report "Host-switch exited with error" on every graceful shutdown.

Fix: Call `s.Close()` to properly shut down the HTTP server:

```diff
+	s := &http.Server{
+		Handler:      mux,
+		ReadTimeout:  10 * time.Second,
+		WriteTimeout: 10 * time.Second,
+	}
+
	g.Go(func() error {
		<-ctx.Done()
-		return vnLn.Close()
+		return s.Close()
	})
	g.Go(func() error {
-		s := &http.Server{
-			Handler:      mux,
-			ReadTimeout:  10 * time.Second,
-			WriteTimeout: 10 * time.Second,
-		}
		err := s.Serve(vnLn)
```

I3. **Missing write deadline in `signalListenerReady`** (important, regression) — `pkg/controllers/lima/limavm/controllers/hostswitch_windows.go:257`

```go
func signalListenerReady(vmGUID hvsock.GUID) error {
	conn, err := getVsockConnection(vmGUID, vsockHandshakePort)
	if err != nil {
		return err
	}
	defer conn.Close()
	_, err = conn.Write([]byte(readySignal))
	return err
}
```

The `conn.Write` call operates without a deadline. A malfunctioning guest process could accept the vsock connection but stall reading the socket buffer, blocking the write indefinitely and holding the connection open.

Fix: Add a write deadline before writing.

```diff
	defer conn.Close()
+	if err := conn.SetWriteDeadline(time.Now().Add(5 * time.Second)); err != nil {
+		return err
+	}
	_, err = conn.Write([]byte(readySignal))
```

### Suggestions

S1. **Missing state cleanup before overwrite in `startHostSwitch`** (suggestion, regression) — `pkg/controllers/lima/limavm/controllers/hostswitch_windows.go:54`

```go
	r.hostSwitchMu.Lock()
	r.hostSwitchStates[name] = &hostSwitchState{
		cancel: hsCancel,
		done:   done,
	}
	r.hostSwitchMu.Unlock()
```

If `startHostSwitch` is invoked twice sequentially without an intervening `stopHostSwitch`, the `hostSwitchStates` map silently overwrites the previous entry, dropping the `cancel` function for the existing goroutine. While `handleWatchedState` usually stops old instances cleanly, defensively stopping the prior instance inside `startHostSwitch` ensures robust state synchronization.

Fix: Unconditionally stop any existing instance before replacing it:

```diff
	if inst.VMType != limatype.WSL2 {
		return
	}

+	r.stopHostSwitch(name)
+
	hsCtx, hsCancel := context.WithCancel(ctx)
```

### Design Observations

**Concerns**
- **AF_VSOCK reliability via brute-force polling (in-scope)**: `getVMGUID` concurrently polls all Hyper-V VMs found in the Windows registry via continuous `hvsock.Dial` attempts. Spamming vsock dials to all VMs (some of which may be suspended or unresponsive) is noisy and prone to spurious timeouts. Since the Lima controller orchestrates the VM lifecycle, an alternative approach would be to extract the active VM's GUID directly from the WSL runtime configuration or Lima inspection output, eliminating the need for a brute-force discovery handshake.

### Testing Assessment

1. **Missing unit tests**: There are no unit tests for `validateSubnet` or `newVirtualNetworkConfig` to ensure `gvisor-tap-vsock` is correctly initialized or that invalid subnet strings are caught cleanly.
2. **Integration coverage**: BATS integration tests (`bats/tests/33-lima-controllers/`) need to be updated to verify that WSL2 instances successfully acquire an IP address over the vsock link.

### Coverage Summary

- `docs/design/api_lima.md` — Trivial
- `docs/design/networking.md` — Reviewed, no issues
- `go.mod` — Trivial
- `pkg/controllers/app/app/lima-images-wsl2.yaml` — Trivial
- `pkg/controllers/lima/limavm/controllers/hostswitch.go` — I1
- `pkg/controllers/lima/limavm/controllers/hostswitch_other.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/hostswitch_windows.go` — C1, C2, C3, I2, I3, S1
- `pkg/controllers/lima/limavm/controllers/limavm_controller.go` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` — C1 (caller context)


### Design Observations

**Strengths**
- The integration of the host-switch goroutine into the LimaVM reconciler lifecycle is cleanly executed (`stopHostSwitch` executes precisely when `stopWatcher` does). 
- The newly added `docs/design/networking.md` is exceptionally clear and precisely illustrates the intent behind the vsock protocol layer.

### Testing Assessment

The PR contains no unit or integration tests for the new virtual network components.
1. **Host-Switch Subnet Validation:** `validateSubnet` and `newVirtualNetworkConfig` contain zero external dependencies and should be unit tested directly to ensure IPv4 derivation behaves correctly.
2. **Lifecycle Mocks:** The state machine for vsock polling could be unit tested with a mocked `registry` layer to prevent regressions around VM GUID discovery behavior (like C1).

### Coverage Summary

- `docs/design/api_lima.md`: Trivial
- `docs/design/networking.md`: Reviewed, no issues
- `go.mod`: Trivial
- `pkg/controllers/app/app/lima-images-wsl2.yaml`: Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/hostswitch.go`: S1
- `pkg/controllers/lima/limavm/controllers/hostswitch_other.go`: Trivial
- `pkg/controllers/lima/limavm/controllers/hostswitch_windows.go`: C1, C2, I1, I2, I3
- `pkg/controllers/lima/limavm/controllers/limavm_controller.go`: Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go`: Reviewed, no issues
