# Deep Review: 20260408-175512-pr-280 | | | |---|---| | **Date** | 2026-04-08 17:55 | | **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) | | **Round** | 3 | | **Author** | [@jandubois](https://github.com/jandubois) | | **PR** | [#280](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/280) — Add host-switch virtual network for WSL2 instances | | **Branch** | `host-switch` | | **Commits** | `a29eda1` Add host-switch virtual network for WSL2 instances
`d498946` Fix go-mod-k8s-sync checkout for cross-repository PRs | | **Reviewers** | Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro | | **Verdict** | **Merge with fixes** — two important issues: accept loop swallows unexpected errors, and one-shot registry read fails when no WSL2 VM exists yet | | **Wall-clock time** | `32 min 27 s` | --- ## Executive Summary This PR ports the host-switch networking logic from rancher-desktop into the LimaVM controller as an in-process goroutine. The goroutine performs a vsock handshake with the WSL2 guest, then runs a gvisor-tap-vsock virtual network providing DNS, DHCP, and NAT. A second commit fixes the CI workflow for cross-repository PRs. The lifecycle integration is thorough — every start, stop, restart, crash, and shutdown path correctly pairs `startHostSwitch` and `stopHostSwitch`. Two issues need attention: the vsock accept loop silently swallows unexpected errors, and the registry-based VM discovery reads only once, failing immediately on fresh systems where no WSL2 VM exists yet. --- ## Critical Issues None. --- ## Important Issues I1. **Accept loop swallows unexpected errors** — `hostswitch_windows.go:198-203` [Claude Opus 4.6, Gemini 3.1 Pro] (important, regression) ```go g.Go(func() error { for { conn, err := ln.Accept() if err != nil { // Listener closed (context cancelled). return nil //nolint:nilerr // Expected on shutdown. } ``` The accept goroutine returns `nil` for all `Accept` failures, not just the `net.ErrClosed` expected during shutdown. If `Accept` fails for an unexpected reason (resource exhaustion, OS-level socket error), the goroutine exits silently. Because it returns `nil`, the errgroup does not cancel the shared context. The HTTP server and context-wait goroutines keep running, but no new vsock connections can be accepted. The guest retains its existing connection but cannot reconnect after a network restart. The only external symptom is a "Host-switch stopped" log at line 241 — no error is surfaced. Fix: Distinguish shutdown from unexpected errors. ```diff conn, err := ln.Accept() if err != nil { - // Listener closed (context cancelled). - return nil //nolint:nilerr // Expected on shutdown. + if errors.Is(err, net.ErrClosed) { + return nil // Listener closed during shutdown. + } + return fmt.Errorf("vsock accept failed: %w", err) } ``` This propagates unexpected errors through the errgroup, which cancels the context and triggers clean shutdown of all goroutines. The error is then logged at line 239. --- I2. **Registry-based VM discovery is one-shot; fails when no WSL2 VM exists** — `hostswitch_windows.go:306-333` [Gemini 3.1 Pro] (important, regression) ```go func getVMGUID(ctx context.Context, logger logr.Logger) (hvsock.GUID, error) { key, err := registry.OpenKey( registry.LOCAL_MACHINE, `SOFTWARE\...\ComputeSystem`, registry.ENUMERATE_SUB_KEYS) if err != nil { return hvsock.GUIDZero, fmt.Errorf("cannot open registry key ...: %w", err) } names, err := key.ReadSubKeyNames(0) // ... if len(names) == 0 { return hvsock.GUIDZero, errors.New("no running Hyper-V VMs found") } ``` `getVMGUID` reads the Hyper-V registry once and launches handshake goroutines for each GUID found. If the registry key is absent or contains zero entries — which happens on a fresh system where no other WSL2 distro is running — the function returns an error immediately. Because `startHostSwitch` runs before the hostagent boots the WSL2 VM (by design, since the guest blocks on the handshake), the utility VM may not exist yet. When another WSL2 distro is already running, the shared utility VM's GUID is in the registry, and `attemptHandshake` polls it every second until the opensuse distro's `network-setup` starts listening. The bug manifests only when no WSL2 VM exists at startup time. Fix: Retry registry reads within the handshake timeout loop, spawning handshake goroutines for newly discovered GUIDs as they appear. ```diff func getVMGUID(ctx context.Context, logger logr.Logger) (hvsock.GUID, error) { - key, err := registry.OpenKey(...) - // ... one-shot read + ticker := time.NewTicker(time.Second) + defer ticker.Stop() + + seen := make(map[hvsock.GUID]bool) + found := make(chan hvsock.GUID, 1) + + for { + select { + case <-ctx.Done(): + return hvsock.GUIDZero, fmt.Errorf("VM GUID discovery timed out: %w", ctx.Err()) + case vmGUID := <-found: + return vmGUID, nil + case <-ticker.C: + key, err := registry.OpenKey(...) + if err != nil { + continue // VM not running yet + } + names, _ := key.ReadSubKeyNames(0) + key.Close() + for _, name := range names { + vmGUID, err := hvsock.GUIDFromString(name) + if err == nil && !seen[vmGUID] { + seen[vmGUID] = true + go attemptHandshake(ctx, logger, vmGUID, found) + } + } + } + } } ``` This merges the current `getVMGUID` and `attemptHandshake` polling into a single retry loop: the registry is re-scanned each tick, and handshake goroutines are spawned only for newly discovered GUIDs. The existing 5-minute `handshakeTimeout` bounds the total wait. --- ## Suggestions S1. **VM discovery cannot distinguish multiple opensuse instances** — `hostswitch_windows.go:306-398` [Claude Opus 4.6, Codex GPT 5.4] (suggestion, gap) ```go if sig == signaturePhrase { logger.V(1).Info("Signature matched", "vmGUID", vmGUID.String()) found <- vmGUID return } ``` `getVMGUID` matches the first VM that sends the fixed `signaturePhrase` (line 98). If two WSL2 VMs run the opensuse distro simultaneously, two host-switch goroutines could bind to the same guest while the other gets no networking. In the current architecture (one App, one LimaVM, one WSL2 instance), this cannot happen. But the code's structure — one host-switch per instance name, map-based state tracking — implies multi-instance support. Fix: Add a comment documenting the single-instance constraint. ```diff // getVMGUID enumerates running Hyper-V VMs from the registry and handshakes -// with each in parallel to find the one running our distro. +// with each in parallel to find the one running our distro. This assumes +// only one opensuse WSL2 instance runs at a time; with multiple instances, +// the signature-based discovery would be ambiguous. ``` S2. **`startHostSwitch` inline cancel-wait duplicates `stopHostSwitch`** — `hostswitch_windows.go:114-122` [Gemini 3.1 Pro] (suggestion, regression) ```go r.hostSwitchMu.Lock() if old, exists := r.hostSwitchStates[name]; exists { r.hostSwitchMu.Unlock() old.cancel() <-old.done r.hostSwitchMu.Lock() } ``` The inline cancel-and-wait duplicates `stopHostSwitch`'s logic. The unlock-wait-relock pattern also opens a narrow window where a concurrent `shutdownAllHostagents` call could delete the old entry from the map, wait on it, and then `startHostSwitch` would insert a new running entry — leaving a host-switch the manager believes it stopped. In practice this requires a shutdown racing with a startup, which is extremely unlikely. Fix: Delegate to the existing `stopHostSwitch` method. ```diff - r.hostSwitchMu.Lock() - if old, exists := r.hostSwitchStates[name]; exists { - r.hostSwitchMu.Unlock() - old.cancel() - <-old.done - r.hostSwitchMu.Lock() - } + r.stopHostSwitch(name) + + r.hostSwitchMu.Lock() r.hostSwitchStates[name] = &hostSwitchState{ ``` S3. **`signalListenerReady` leaks goroutine when `hvsock.Dial` hangs** — `hostswitch_windows.go:319-337` [Codex GPT 5.4, Gemini 3.1 Pro] (suggestion, gap) ```go go func() { conn, err := getVsockConnection(vmGUID, vsockHandshakePort) // ... }() select { case r := <-ch: return r.err case <-ctx.Done(): return ctx.Err() } ``` The comment at lines 316-318 says the `select` "prevents an indefinite hang," but it prevents only the caller from hanging. If `hvsock.Dial` blocks forever, the background goroutine leaks. This is a known limitation of the hvsock library (no cancelable dial), already documented in the code. The goroutine typically leaks only during process shutdown, when it is about to be cleaned up anyway. Fix: Correct the comment to avoid implying the goroutine is canceled. ```diff // signalListenerReady tells the guest that the data listener is ready. -// The dial is wrapped in a goroutine because hvsock.Dial does not accept -// a context; the select on ctx.Done prevents an indefinite hang if the -// VM becomes unresponsive between discovery and signaling. +// The dial is wrapped in a goroutine because hvsock.Dial does not accept +// a context. The select on ctx.Done prevents the caller from hanging, but +// the dial goroutine itself may leak if the VM becomes unresponsive. ``` S4. **Platform-independent logic locked behind Windows build constraint** — `hostswitch_windows_test.go` [Claude Opus 4.6] (suggestion, gap) `validateSubnet` and `newVirtualNetworkConfig` use only `net.ParseCIDR` and `net.IPv4` — no Windows-specific APIs. Because they reside in `hostswitch_windows.go`, the tests run only on Windows CI. Extracting these functions and their types into a `hostswitch_config.go` (no build constraint) would let the tests run on all platforms. S5. **`AcceptStdio` blocks accept loop, serializing connections** — `hostswitch_windows.go:205` [Claude Opus 4.6] (suggestion, enhancement) `AcceptStdio` blocks until the connection closes, so the accept goroutine handles one connection at a time. This is correct because each VM runs a single `vm-switch` process — reconnections are serial (old connection EOF, then accept new connection). A clarifying comment would prevent a future contributor from "fixing" this with `go vn.AcceptStdio(...)`. ```diff + // AcceptStdio blocks until the connection closes, so the loop + // handles one connection at a time. This is correct: each VM + // runs a single vm-switch process, and reconnections are serial. if err := vn.AcceptStdio(ctx, conn); err != nil { ``` S6. **`attemptHandshake` delays first probe by 1 second** — `hostswitch_windows.go:276-281` [Gemini 3.1 Pro] (suggestion, gap) `time.NewTicker` fires its first event after the specified duration, so the first handshake probe is delayed by 1 second. Adding an immediate attempt before entering the ticker loop would shave a second off networking availability on every boot. S7. **Networking doc overstates Phase 1 identification** — `docs/design/networking.md:44-64` [Codex GPT 5.4] (suggestion, regression) The doc describes the signature phrase as sufficient to identify "the correct" VM, which holds only while at most one matching opensuse guest is running. The text should document this single-instance constraint, consistent with S1. S8. **`validateSubnet` ignores CIDR mask** — `hostswitch_windows.go:68-73` [Gemini 3.1 Pro] (suggestion, gap) `validateSubnet` hardcodes the fourth octet to `.1`, `.2`, and `.254` without checking the mask. For a `/25` subnet, `.254` would fall outside the valid range. Since the function is called only with the hardcoded `defaultSubnet = "192.168.127.0/24"`, this cannot happen today. A defensive assertion (`ones != 24 → error`) would guard against future misuse. --- ## Design Observations ### Concerns 1. **(future)** The discovery protocol needs per-instance identity, not a product-wide signature string. Moving to an instance token or GUID mapping would eliminate the class of cross-instance misbinding bugs behind S1 and make the "one goroutine per instance" contract enforceable. [Codex GPT 5.4, Gemini 3.1 Pro] 2. **(future)** The host-switch should integrate with the reconciler's state machine rather than running as an unobserved side goroutine. A readiness/health channel between `startInstance` and `runHostSwitch` would make startup sequencing explicit and let the reconciler handle host-switch failures the same way it handles hostagent exits. The author acknowledges this in a code comment (see Acknowledged Limitations). [Codex GPT 5.4] ### Strengths - **Thorough lifecycle integration.** Every code path that stops a hostagent also stops the host-switch: `handleDeletion` (line 40), `handleWatchedState` crash path (line 213), stopped-not-shouldRun path (line 268), `handleRestartNeeded` (line 334), `startInstance` failure paths (lines 458, 468), `stopInstance` (line 506), and `shutdownAllHostagents` (line 543). All eight call sites are correctly paired. [Claude Opus 4.6, Codex GPT 5.4] - **Errgroup structure avoids goroutine leaks.** The API listener is set up before launching errgroup goroutines. The listener-close goroutines ensure `Accept` and `Serve` unblock on context cancellation — a well-designed shutdown cascade. [Claude Opus 4.6] - **Clean platform isolation via `hostSwitchPlatform` embedding.** No `if runtime.GOOS == "windows"` checks in the controller; the compiler ensures all methods exist on all platforms. [Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro] - **Design document matches implementation.** Every constant, port, MAC address, and subnet in `networking.md` matches the code. Lifecycle sequence diagrams correctly reflect the implementation. [Claude Opus 4.6] --- ## Testing Assessment 1. **No integration test for host-switch lifecycle** — The host-switch requires a running WSL2 VM, which CI cannot provide. Unit tests cover `validateSubnet` and `newVirtualNetworkConfig`, which are the testable pure-logic components. This is a reasonable trade-off. 2. **Handshake logic untested** — `getVMGUID`, `attemptHandshake`, `readSignature`, and `signalListenerReady` depend on vsock and Windows registry, making them impractical to test without a real VM. 3. **No coverage for concurrent start/stop** — No test exercises a host-switch surviving concurrent start and stop events, though this is difficult to trigger in practice (reconciles are serialized). --- ## Documentation Assessment - `networking.md` is thorough and well-structured: protocol constants, lifecycle sequences, and architecture are all documented. - The `api_lima.md` cross-reference at line 327 correctly points to the new doc. - S7 notes that the Phase 1 description should document the single-instance constraint. --- ## Commit Structure The two commits are well-scoped. The CI workflow fix (`d498946`) is independent of the host-switch feature and could have been a separate PR, but bundling it is acceptable given its small size. The PR body clearly separates the two changes with a horizontal rule. --- ## Acknowledged Limitations 1. **Host-switch failure invisible to reconciler** — `hostswitch_windows.go:152-157`: "If the goroutine exits due to an error (not context cancellation), the controller is not notified: the guest loses DHCP/DNS/NAT and must be restarted manually. Integrating host-switch health into the controller's state machine (enqueue a reconcile on unexpected exit) would allow automatic recovery but requires non-trivial plumbing." This limitation becomes more relevant once I2 is fixed (retry loop means the goroutine can fail later, not just at startup). 2. **`hvsock.Dial` does not accept a context** — `hostswitch_windows.go:316-318`: The goroutine wrapping `hvsock.Dial` may leak if the dial blocks indefinitely. The process is typically exiting when this happens. 3. **Sequential shutdown** — `limavm_controller.go:508` (pre-existing): `shutdownAllHostagents` waits on each hostagent sequentially. Adding `stopHostSwitch(name)` at line 543 adds to per-instance shutdown time, but the stop is fast (context cancel → goroutine exits). --- ## Agent Performance Retro ### [Claude] Claude produced the most thorough analysis of the lifecycle integration, tracing all eight `startHostSwitch`/`stopHostSwitch` call sites and confirming each was correctly paired. It uniquely investigated the `AcceptStdio` blocking behavior by reading the upstream gvisor-tap-vsock source (`tap/switch.go`), providing concrete evidence for S5. Its coverage was complete — every file reviewed with detailed notes. It used `git blame` four times to verify regression attribution. Claude missed the one-shot registry read bug (I2) despite reading the `getVMGUID` function. This is the most significant coverage gap: the timing between `startHostSwitch` and the hostagent starting the VM is the central design constraint, and Claude acknowledged the ordering without questioning whether the registry would contain entries at that point. ### [Codex] Codex provided the deepest analysis of the startup sequencing problem, identifying that host-switch failures are invisible to the reconciler (moved to Acknowledged Limitations since the author documented this). It uniquely raised the multi-instance ambiguity as an important issue (downgraded to S1 since the current architecture supports only one WSL2 instance). Codex ran `go test` and fetched PR review comments via `gh api`, showing initiative in gathering evidence. It used `git blame` ten times — the most of any agent. Codex also missed the one-shot registry bug. Like Claude, it focused on what happens after the handshake rather than whether the handshake's prerequisites are met. ### [Gemini] Gemini uniquely identified the one-shot registry read bug (I2, originally rated C1). This was the most important unique finding across all agents: neither Claude nor Codex caught the timing issue despite reading the same code. Gemini also identified the `startHostSwitch` race condition (S2) and the `validateSubnet` mask issue (S8). However, Gemini's tool usage was minimal (7 calls vs. 39 for Claude and 47 for Codex). It never ran `git blame`, so its regression classifications lack blame evidence. The `signalListenerReady` leak finding (S3) and the `AcceptStdio` serialization concern (S5) overlap with Claude and Codex but without the upstream source verification that Claude performed. Gemini rated the registry bug as CRITICAL. The consolidated report downgrades it to IMPORTANT because the bug requires no other WSL2 distro to be running — on most developer machines, WSL2 is already active, keeping the utility VM's GUID in the registry. **Tool call highlights** - **Claude**: 4 `git blame` calls, verified upstream `AcceptStdio` behavior in `gvisor-tap-vsock` source. Thorough context-building. - **Codex**: 10 `git blame` calls, ran `go test`, fetched PR comments via `gh api`. Most comprehensive tool usage. - **Gemini**: No `git blame`, only 7 total tool calls. Unable to verify regression classifications. Read only `limavm_lifecycle.go` in full; other files accessed only via grep. ### Summary | | Claude Opus 4.6 | Codex GPT 5.4 | Gemini 3.1 Pro | |---|---|---|---| | Duration | 482s | 410s | 261s | | Findings | 0C 1I 3S | 0C 0I 3S | 0C 2I 3S | | Tool calls | 39 (Read 23, Bash 11) | 47 (exec_command 45) | 7 (grep_search 5) | | Design observations | 4 strengths | 2 concerns, 2 strengths | 1 concern, 1 strength | | False positives | 0 | 0 | 0 | | Unique insights | S5 (AcceptStdio blocking) | Host-switch health (→ ack'd limitation) | I2 (one-shot registry), S2, S8 | | Files reviewed | 11/11 | 11/11 | 11/11 | | Coverage misses | Missed I2 | Missed I1, I2 | 0 | **Totals** | | Claude Opus 4.6 | Codex GPT 5.4 | Gemini 3.1 Pro | |---|---|---|---| | Totals | 0C 1I 3S | 0C 0I 3S | 0C 2I 3S | | Downgraded | 0 | 2 (I1→ack'd, I2→S1) | 1 (C1→I2) | | Dropped | 0 | 0 | 0 | **Reconciliation**: Gemini C1 (one-shot registry) → consolidated I2 (downgraded from critical to important: only triggers when no other WSL2 distro is running). Codex I1 (startup failure invisible) → acknowledged limitation (author already documented this in code comment at `hostswitch_windows.go:152-157`). Codex I2 (multi-instance ambiguity) → consolidated S1 (downgraded from important to suggestion: current architecture supports only one WSL2 instance). --- ## Review Process Notes No suggestions for this round. The prompt guided all three agents to the accept-loop error handling issue independently. The one-shot registry bug was caught by the agent that focused more on timing constraints than on code tracing — a reminder that different analytical styles complement each other. --- ## Resolution | | | |---|---| | **Addressed** | 2026-04-08 | | **Commit** | `4c4439f Add host-switch virtual network for WSL2 instances` | | # | Finding | Action | |---|---------|--------| | 1 | Important #1: Accept loop swallows unexpected errors | Fixed | | 2 | Important #2: Registry-based VM discovery is one-shot | Fixed | | 3 | Suggestion #1: VM discovery cannot distinguish multiple opensuse instances | Commented | | 4 | Suggestion #2: startHostSwitch inline cancel-wait duplicates stopHostSwitch | Fixed | | 5 | Suggestion #3: signalListenerReady leaks goroutine when hvsock.Dial hangs | Fixed | | 6 | Suggestion #4: Platform-independent logic locked behind Windows build constraint | Skipped | | 7 | Suggestion #5: AcceptStdio blocks accept loop, serializing connections | Commented | | 8 | Suggestion #6: attemptHandshake delays first probe by 1 second | Fixed | | 9 | Suggestion #7: Networking doc overstates Phase 1 identification | Fixed | | 10 | Suggestion #8: validateSubnet ignores CIDR mask | Skipped | | 11 | Testing Gap #1: No integration test for host-switch lifecycle | Skipped | | 12 | Testing Gap #2: Handshake logic untested | Skipped | | 13 | Testing Gap #3: No coverage for concurrent start/stop | Skipped | --- ## Appendix: Original Reviews ### [Claude Opus 4.6] — Pass 1 I've reviewed all changed files. Let me now write the complete review. --- # Deep Code Review: PR #280 — Add host-switch virtual network for WSL2 instances | Attribute | Value | |-----------|-------| | **PR** | #280 | | **Commits** | 2 (`a29eda1`, `d498946`) | | **Files changed** | 11 | | **Verdict** | **Approve with minor fix** — one important issue in the accept loop error handling; everything else is solid. | ## Executive Summary This PR ports the host-switch networking logic from rancher-desktop into the LimaVM controller as an in-process goroutine that provides DNS, DHCP, and NAT to WSL2 instances via gvisor-tap-vsock over AF_VSOCK. The lifecycle integration is thorough — every code path that starts, stops, restarts, crashes, or shuts down correctly pairs `startHostSwitch` and `stopHostSwitch`. A second commit fixes the CI workflow for cross-repository PRs. The main issue is that the vsock accept loop treats all errors as shutdown, silently degrading the host-switch when an unexpected error occurs. ## Findings ### Critical Issues None. ### Important Issues I1. **Accept loop swallows unexpected errors** — `hostswitch_windows.go:198-203` (important, regression) ```go g.Go(func() error { for { conn, err := ln.Accept() if err != nil { // Listener closed (context cancelled). return nil //nolint:nilerr // Expected on shutdown. } ``` The goroutine returns `nil` for **all** `Accept` failures, not just the expected `net.ErrClosed` from shutdown. If `Accept` fails for an unexpected reason (e.g., resource exhaustion, OS-level socket error), the goroutine exits silently: the errgroup does not cancel, the other goroutines keep running, and the host-switch enters a degraded state where no new vsock connections can be accepted. The guest retains its existing connection but cannot reconnect after a network restart. The only external symptom is the "Host-switch stopped" log at line 241 — no error is surfaced. Fix: Distinguish shutdown from unexpected errors: ```diff conn, err := ln.Accept() if err != nil { - // Listener closed (context cancelled). - return nil //nolint:nilerr // Expected on shutdown. + if errors.Is(err, net.ErrClosed) { + return nil // Listener closed during shutdown. + } + return fmt.Errorf("vsock accept failed: %w", err) } ``` This propagates unexpected errors through the errgroup, which cancels the context and triggers a clean shutdown of all goroutines. The error is then logged at line 239. ### Suggestions S1. **VM discovery cannot distinguish multiple opensuse instances** — `hostswitch_windows.go:306-358` (suggestion, gap) ```go func getVMGUID(ctx context.Context, logger logr.Logger) (hvsock.GUID, error) { // ... enumerates all running Hyper-V VMs and matches a fixed signature ``` `getVMGUID` at line 306 matches the first VM that sends the correct `signaturePhrase`. If two WSL2 VMs run the opensuse distro simultaneously, two host-switch goroutines race to discover VMs, and each may connect to the wrong one (or both to the same VM, leaving the other without networking). The `runHostSwitch` goroutine at line 158 receives no instance name to pass to the handshake logic. In the current architecture (one App → one LimaVM → one WSL2 instance), this cannot happen. But the code's structure (one host-switch per instance name, map-based state tracking) implies multi-instance support. If the design ever requires multiple WSL2 instances, the handshake protocol needs an instance-specific identifier. Fix: Not needed now, but a brief comment on the design constraint would help future readers: ```diff // getVMGUID enumerates running Hyper-V VMs from the registry and handshakes -// with each in parallel to find the one running our distro. +// with each in parallel to find the one running our distro. This assumes +// only one opensuse WSL2 instance runs at a time; with multiple instances, +// the signature-based discovery would be ambiguous. ``` S2. **Platform-independent logic locked behind Windows-only build constraint** — `hostswitch_windows_test.go` (suggestion, gap) ```go func TestValidateSubnet(t *testing.T) { t.Run("default subnet", func(t *testing.T) { s, err := validateSubnet("192.168.127.0/24") ``` `validateSubnet` (line 68) and `newVirtualNetworkConfig` (line 246) use only `net.ParseCIDR` and `net.IPv4` — no Windows-specific APIs. Because they're defined in `hostswitch_windows.go` (implicit `//go:build windows`), these tests can only run on Windows CI runners. Extracting these two functions and the types they operate on (`hostSwitchSubnet`, `hostSwitchSubnet` constants) into a `hostswitch_config.go` file (no build constraint) would allow the tests to run on all platforms and in `go test ./...` locally. S3. **`AcceptStdio` blocks the accept loop, serializing connections** — `hostswitch_windows.go:205` (suggestion, enhancement) ```go if err := vn.AcceptStdio(ctx, conn); err != nil { logger.Error(err, "Failed to accept connection into virtual network") ``` `AcceptStdio` calls `Switch.Accept` which calls `rx()` and blocks until the connection closes or the context is cancelled. This means the accept goroutine handles one connection at a time. If the guest's `vm-switch` dies, the old `AcceptStdio` must detect the dead connection (read returns EOF) before the loop can accept the reconnection. In practice this works because vsock EOF propagates immediately on process death. But a comment explaining why sequential handling is correct (one `vm-switch` per guest, reconnection is serial) would prevent a future contributor from "fixing" it with `go vn.AcceptStdio(...)`. Fix: Add a clarifying comment: ```diff + // AcceptStdio blocks until the connection closes, so the loop + // handles one connection at a time. This is correct because + // each VM runs a single vm-switch process; reconnections are + // serial (old connection EOF → accept new connection). if err := vn.AcceptStdio(ctx, conn); err != nil { ``` ## Design Observations ### Strengths - **Thorough lifecycle integration.** Every code path that stops a hostagent also stops the host-switch: `handleDeletion` (line 40), `handleWatchedState` crash path (line 213), stopped-not-shouldRun path (line 268), `handleRestartNeeded` (line 334), `startInstance` failure paths (lines 458, 468), `stopInstance` (line 506), and `shutdownAllHostagents` (line 543). I traced all eight call sites and confirmed each is paired correctly. - **Errgroup structure avoids goroutine leaks.** Setting up the API listener at line 184 _before_ launching errgroup goroutines means early failures don't leak goroutines. The listener-close goroutines (lines 214-217, 219-221) ensure `Accept` and `Serve` unblock on context cancellation. This is a well-thought-out shutdown cascade. - **Design document is accurate and detailed.** Every constant, port number, MAC address, and subnet in `networking.md` matches the code. The lifecycle sequence diagrams correctly reflect the implementation. - **Platform-conditional embedding via `hostSwitchPlatform` struct.** The approach of embedding a platform-specific struct with no-op methods on non-Windows is clean — no `if runtime.GOOS == "windows"` checks scattered through the controller, and the compiler ensures all methods exist on all platforms. - **`startHostSwitch` re-entry handling.** Lines 116-122 correctly cancel and wait for an old host-switch before starting a new one, preventing duplicate goroutines for the same instance. ## Testing Assessment 1. **No integration test for host-switch lifecycle** — The host-switch requires a running WSL2 VM, which CI likely can't provide. The unit tests cover `validateSubnet` and `newVirtualNetworkConfig`, which are the testable pure-logic components. This is a reasonable trade-off. 2. **Handshake logic untested** — `getVMGUID`, `attemptHandshake`, `readSignature`, and `signalListenerReady` have no unit tests. These depend on vsock and Windows registry, making them difficult to test without mocks or a real VM. The code is straightforward enough that manual testing is reasonable. 3. **Error paths in `runHostSwitch` untested** — The early-return paths (subnet validation, handshake failure, virtualnetwork creation failure, API listener failure) are not tested. These are simple error-return sequences that would require extensive mocking with little value. ## Documentation Assessment - `networking.md` is thorough and well-structured. Protocol constants, lifecycle sequences, and architecture are all documented. - The `api_lima.md` cross-reference at line 327 correctly points to the new doc. - Code comments are accurate: the `//go:build` constraints, method doc comments, and inline explanations all match the implementation. ## Commit Structure The two commits are well-scoped: - `d498946` ("Fix go-mod-k8s-sync checkout for cross-repository PRs") is a standalone CI fix, independent of the host-switch feature. Could have been a separate PR, but bundling it is acceptable since it's small and self-contained. - `a29eda1` ("Add host-switch virtual network for WSL2 instances") contains the entire feature: code, tests, design doc, image change, and dependency updates. Coherent as a single commit. ## Acknowledged Limitations 1. **"the controller is not notified"** — `hostswitch_windows.go:152-157`: If the goroutine exits due to a non-cancellation error, the guest loses networking silently. The comment proposes enqueuing a reconcile on unexpected exit as a future improvement. 2. **`hvsock.Dial` does not accept a context** — `hostswitch_windows.go:415-418`: The workaround of wrapping the dial in a goroutine with a select on `ctx.Done()` is explicitly documented. The goroutine may leak if the dial blocks indefinitely, but the process is typically exiting when this happens. 3. **Sequential shutdown** — `limavm_controller.go:508` (pre-existing TODO): `shutdownAllHostagents` waits on each hostagent sequentially. The addition of `stopHostSwitch(name)` at line 543 adds to the per-instance shutdown time (host-switch cancellation + wait). This compounds the existing sequential problem but doesn't materially worsen it since `stopHostSwitch` is fast (context cancel → goroutine exits). ## Unresolved Feedback Both of Nino-K's review comments were addressed in the current code: 1. [Early return leaking errgroup goroutines](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/280#discussion_r3054525099) — Fixed by reordering: the API listener setup (line 184) now precedes the errgroup creation (line 195), so the early return at line 188 cannot leak goroutines. 2. [Vsock listener leak on `virtualnetwork.New` failure](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/280#discussion_r3054530759) — Fixed by adding `ln.Close()` at line 176 before returning. ## Coverage Summary | File | Status | |------|--------| | `.github/workflows/go-mod-k8s-sync.yaml` | Reviewed, no issues | | `bats/tests/32-app-controllers/app.bats` | Reviewed, no issues — timeout increase from 60s to 150s is appropriate for the larger opensuse image (~350MB vs finch rootfs) | | `docs/design/api_lima.md` | Trivial — one-line cross-reference | | `docs/design/networking.md` | Reviewed, no issues — verified all constants match code | | `go.mod` | Reviewed, no issues — three deps promoted from indirect to direct, matching new imports | | `pkg/controllers/app/app/lima-images-wsl2.yaml` | Reviewed, no issues — image URL and digest updated | | `pkg/controllers/lima/limavm/controllers/hostswitch_other.go` | Reviewed, no issues | | `pkg/controllers/lima/limavm/controllers/hostswitch_windows.go` | **I1, S1, S3** | | `pkg/controllers/lima/limavm/controllers/hostswitch_windows_test.go` | **S2** | | `pkg/controllers/lima/limavm/controllers/limavm_controller.go` | Reviewed, no issues — `hostSwitchPlatform` embedding and `initHostSwitch` / `stopHostSwitch` placement are correct | | `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` | Reviewed, no issues — all eight `stopHostSwitch`/`startHostSwitch` call sites are correctly placed in the lifecycle | ### [Codex GPT 5.4] — Pass 1 ### Executive Summary This PR adds an in-process Windows host-switch for WSL2 LimaVMs, swaps the WSL2 app image to the opensuse distro, documents the new networking path, and adjusts the `go-mod` sync workflow for fork PRs. The main risks are in the new host-switch lifecycle: startup failures are invisible to the reconciler and can leave a VM stuck in `Running=False/Starting`, and the vsock discovery protocol cannot uniquely bind a host-switch to a specific WSL2 instance when more than one matching guest is running. ### Findings **Critical Issues** None. **Important Issues** I1. **Host-switch startup can fail without ever failing the LimaVM startup** — [limavm_lifecycle.go](/private/tmp/review-codex-R8dWHa/pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go):445-479 (important, regression) ```go // Start the host-switch virtual network for WSL2 instances. This must // happen before the hostagent starts, because the guest's // network-setup.service performs a vsock handshake during early boot. r.startHostSwitch(ctx, limaVM.Name, inst) // Start hostagent in background. haCmd := exec.CommandContext(ctx, rddPath, args...) ``` `startInstance()` launches the host-switch asynchronously at line 448 and then immediately starts the hostagent. If `runHostSwitch()` exits on `vsockHandshake()` / `virtualnetwork.New()` / `vn.Listen()` failure at [hostswitch_windows.go](/private/tmp/review-codex-R8dWHa/pkg/controllers/lima/limavm/controllers/hostswitch_windows.go):167-188, nothing propagates that failure back into reconciliation. The only watched path for a still-running hostagent is `phaseStarting && shouldRun`, which returns `ctrl.Result{}, nil` at [limavm_lifecycle.go](/private/tmp/review-codex-R8dWHa/pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go):187-195 and waits for another watcher event. That event never arrives when the guest is blocked in the early-boot vsock handshake, so the LimaVM can sit in `ReasonStarting` indefinitely even though startup is already unrecoverable. Fix: make host-switch readiness part of startup, not a fire-and-forget sidecar. `startHostSwitch()` should return a ready/error channel and `startInstance()` should wait for either readiness or failure before launching the hostagent, then enqueue a reconcile or set `Running=False/StartFailed` if the host-switch dies unexpectedly later. ```diff - r.startHostSwitch(ctx, limaVM.Name, inst) + hs, err := r.startHostSwitch(ctx, limaVM.Name, inst) + if err != nil { + _ = r.updateCondition(ctx, limaVM, ConditionRunning, metav1.ConditionFalse, ReasonStartFailed, err.Error()) + return ctrl.Result{}, err + } + if err := hs.WaitReady(ctx); err != nil { + _ = r.updateCondition(ctx, limaVM, ConditionRunning, metav1.ConditionFalse, ReasonStartFailed, err.Error()) + return ctrl.Result{}, err + } ``` I2. **The handshake cannot distinguish two opensuse WSL2 guests** — [hostswitch_windows.go](/private/tmp/review-codex-R8dWHa/pkg/controllers/lima/limavm/controllers/hostswitch_windows.go):391-394 (important, regression) ```go if sig == signaturePhrase { logger.V(1).Info("Signature matched", "vmGUID", vmGUID.String()) found <- vmGUID return } ``` The controller starts one host-switch per instance at [hostswitch_windows.go](/private/tmp/review-codex-R8dWHa/pkg/controllers/lima/limavm/controllers/hostswitch_windows.go):108-130, but VM discovery picks the first guest that returns the global `signaturePhrase` from line 98. `getVMGUID()` returns the first matching GUID at lines 350-355, and `attemptHandshake()` has no instance-specific input at lines 364-398, so two WSL2 LimaVMs running the same opensuse image are indistinguishable. In that case, multiple host-switch goroutines can bind to the same guest while another guest never gets networking, which contradicts the “one host-switch goroutine for each WSL2 instance” design described in [networking.md](/private/tmp/review-codex-R8dWHa/docs/design/networking.md):7. Fix: add a per-instance identity to the protocol. The guest should echo an instance UUID, distro name, or other unique token derived from the LimaVM, and `getVMGUID()` should match that token instead of a shared constant. If Windows exposes a reliable registry mapping from compute-system GUID to distro name, filtering there would also work. ```diff - signaturePhrase = "github.com/rancher-sandbox/rancher-desktop/src/go/networking" + // unique per LimaVM, persisted in instance config and echoed by network-setup + expectedHandshakeToken := inst.HostSwitchToken ... - if sig == signaturePhrase { + if sig == expectedHandshakeToken { found <- vmGUID return } ``` **Suggestions** S1. **`signalListenerReady()` still leaks a goroutine when `hvsock.Dial` hangs** — [hostswitch_windows.go](/private/tmp/review-codex-R8dWHa/pkg/controllers/lima/limavm/controllers/hostswitch_windows.go):419-437 (suggestion, regression) ```go go func() { conn, err := getVsockConnection(vmGUID, vsockHandshakePort) ... }() select { case r := <-ch: return r.err case <-ctx.Done(): return ctx.Err() } ``` The comment at lines 416-418 says the `select` “prevents an indefinite hang,” but it only prevents the caller from hanging. If `getVsockConnection()` blocks forever, the goroutine spawned at line 422 stays blocked forever too; returning on `ctx.Done()` at line 435 does not cancel that dial. Each failed startup can therefore leak another goroutine. Fix: either switch to a cancelable dial path, or explicitly document this as an unavoidable upstream limitation and avoid spawning a fresh uncancelable goroutine on every retry. At minimum, correct the comment so it does not imply the goroutine is canceled. S2. **The new networking doc overstates what Phase 1 identification guarantees** — [networking.md](/private/tmp/review-codex-R8dWHa/docs/design/networking.md):44-64 (suggestion, regression) ```md Multiple Hyper-V VMs may be running. The host-switch identifies the correct one by exchanging a signature phrase. ... The signature phrase is "github.com/rancher-sandbox/rancher-desktop/src/go/networking". ``` This text describes the signature phrase as sufficient to identify “the correct” VM, but the implementation only checks a shared constant at [hostswitch_windows.go](/private/tmp/review-codex-R8dWHa/pkg/controllers/lima/limavm/controllers/hostswitch_windows.go):98 and [hostswitch_windows.go](/private/tmp/review-codex-R8dWHa/pkg/controllers/lima/limavm/controllers/hostswitch_windows.go):391-394. That is true only while there is at most one matching opensuse guest. Once I2 is fixed, this doc should explain the real discriminator; until then it should document the current single-matching-guest limitation. Fix: update the Phase 1 description to match the current implementation, or revise it together with the protocol change from I2. ### Design Observations **Concerns** 1. `(in-scope)` The host-switch should be modeled as part of the LimaVM running-state machine instead of an unobserved side goroutine. A small readiness/health interface between `startInstance()` and `runHostSwitch()` would eliminate I1, make startup sequencing explicit, and let reconcile logic handle host-switch exits the same way it already handles hostagent exits. 2. `(future)` The discovery protocol needs a per-instance identity, not a product-wide signature string. Moving to an instance token or GUID mapping would remove the class of cross-instance misbinding bugs behind I2 and make the “one goroutine per instance” contract enforceable. **Strengths** 1. The platform split via [hostswitch_other.go](/private/tmp/review-codex-R8dWHa/pkg/controllers/lima/limavm/controllers/hostswitch_other.go) keeps non-Windows behavior clean and avoids infecting the existing Unix path with WSL2-specific branches. 2. The stop/restart integration points in [limavm_lifecycle.go](/private/tmp/review-codex-R8dWHa/pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go) and [limavm_controller.go](/private/tmp/review-codex-R8dWHa/pkg/controllers/lima/limavm/controllers/limavm_controller.go) are consistently wired so the host-switch is torn down during deletion, restart, explicit stop, and manager shutdown. ### Testing Assessment 1. No test covers a host-switch startup failure and verifies that the LimaVM moves to `ReasonStartFailed` instead of hanging in `ReasonStarting` (I1). 2. No test runs two WSL2 LimaVMs concurrently to prove that each host-switch binds to its own guest rather than the first guest that returns the shared signature (I2). 3. The new unit tests only cover subnet/config assembly; they do not exercise `runHostSwitch()`, handshake discovery, shutdown, or the uncancelable `signalListenerReady()` path (S1). ### Documentation Assessment The new WSL2 networking doc is useful, but S2 should be addressed: [networking.md](/private/tmp/review-codex-R8dWHa/docs/design/networking.md):44-64 currently claims the signature phrase identifies the correct VM, which is stronger than the implementation actually guarantees. ### Coverage Summary - [.github/workflows/go-mod-k8s-sync.yaml](/private/tmp/review-codex-R8dWHa/.github/workflows/go-mod-k8s-sync.yaml) — Reviewed, no issues - [bats/tests/32-app-controllers/app.bats](/private/tmp/review-codex-R8dWHa/bats/tests/32-app-controllers/app.bats) — Reviewed, no issues - [docs/design/api_lima.md](/private/tmp/review-codex-R8dWHa/docs/design/api_lima.md) — Reviewed, no issues - [docs/design/networking.md](/private/tmp/review-codex-R8dWHa/docs/design/networking.md) — S2 - [go.mod](/private/tmp/review-codex-R8dWHa/go.mod) — Reviewed, no issues - [pkg/controllers/app/app/lima-images-wsl2.yaml](/private/tmp/review-codex-R8dWHa/pkg/controllers/app/app/lima-images-wsl2.yaml) — Reviewed, no issues - [pkg/controllers/lima/limavm/controllers/hostswitch_other.go](/private/tmp/review-codex-R8dWHa/pkg/controllers/lima/limavm/controllers/hostswitch_other.go) — Reviewed, no issues - [pkg/controllers/lima/limavm/controllers/hostswitch_windows.go](/private/tmp/review-codex-R8dWHa/pkg/controllers/lima/limavm/controllers/hostswitch_windows.go) — I2, S1 - [pkg/controllers/lima/limavm/controllers/hostswitch_windows_test.go](/private/tmp/review-codex-R8dWHa/pkg/controllers/lima/limavm/controllers/hostswitch_windows_test.go) — Reviewed, no issues - [pkg/controllers/lima/limavm/controllers/limavm_controller.go](/private/tmp/review-codex-R8dWHa/pkg/controllers/lima/limavm/controllers/limavm_controller.go) — Reviewed, no issues - [pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go](/private/tmp/review-codex-R8dWHa/pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go) — I1 ### [Gemini 3.1 Pro] — Pass 1 ### Executive Summary This PR introduces a host-switch virtual network for WSL2 instances, bridging DNS, DHCP, and NAT via AF_VSOCK instead of standard host networking. While the architectural approach is solid and well-isolated via build tags, the implementation contains a critical race condition: the controller attempts to discover the VM's identity before the VM has actually booted, causing the virtual network setup to abort immediately. The PR also introduces important edge cases involving concurrent controller state mutations and partial-failure modes in the vsock connection loop. ### Findings #### Critical Issues **C1. Host-switch fails immediately if no WSL2 VM is running** — `pkg/controllers/lima/limavm/controllers/hostswitch_windows.go:406-421` ```go key, err := registry.OpenKey( registry.LOCAL_MACHINE, `SOFTWARE\Microsoft\Windows NT\CurrentVersion\HostComputeService\VolatileStore\ComputeSystem`, registry.ENUMERATE_SUB_KEYS) if err != nil { return hvsock.GUIDZero, fmt.Errorf("cannot open registry key (is a WSL VM running?): %w", err) } names, err := key.ReadSubKeyNames(0) key.Close() if err != nil { return hvsock.GUIDZero, fmt.Errorf("cannot read VM IDs from registry: %w", err) } ``` The reconciler calls `startHostSwitch` *before* starting the `hostagent` (which actually boots the WSL2 VM). When `startHostSwitch` spawns its background goroutine, it calls `getVMGUID` which reads the Hyper-V registry exactly once. Because the VM has not booted yet, its GUID will not be in the registry. `getVMGUID` will fail immediately, causing the host-switch goroutine to exit. When the guest eventually boots a few seconds later, its `network-setup.service` will deadlock waiting for the vsock handshake. (The only reason this might accidentally work during local testing is if the developer happens to have a different WSL2 distro already running, which keeps the shared utility VM alive). **Fix**: Refactor `getVMGUID` to retry on a ticker until `ctx.Done()`, spawning handshake goroutines for newly discovered GUIDs as they appear in the registry. ```go func getVMGUID(ctx context.Context, logger logr.Logger) (hvsock.GUID, error) { ticker := time.NewTicker(time.Second) defer ticker.Stop() seen := make(map[hvsock.GUID]bool) found := make(chan hvsock.GUID, 1) for { select { case <-ctx.Done(): return hvsock.GUIDZero, fmt.Errorf("VM GUID discovery timed out: %w", ctx.Err()) case vmGUID := <-found: return vmGUID, nil case <-ticker.C: key, err := registry.OpenKey(registry.LOCAL_MACHINE, `SOFTWARE\Microsoft\Windows NT\CurrentVersion\HostComputeService\VolatileStore\ComputeSystem`, registry.ENUMERATE_SUB_KEYS) if err != nil { continue // Wait for Hyper-V to create the VM } names, _ := key.ReadSubKeyNames(0) key.Close() for _, name := range names { vmGUID, err := hvsock.GUIDFromString(name) if err == nil && !seen[vmGUID] { seen[vmGUID] = true go attemptHandshake(ctx, logger, vmGUID, found) } } } } } ``` #### Important Issues **I1. Vsock accept loop swallows OS errors and leaks goroutines** — `pkg/controllers/lima/limavm/controllers/hostswitch_windows.go:226-229` ```go conn, err := ln.Accept() if err != nil { // Listener closed (context cancelled). return nil //nolint:nilerr // Expected on shutdown. } ``` The accept loop assumes that any error from `ln.Accept()` is due to graceful shutdown (the listener being closed). If `Accept()` fails due to a temporary OS error (e.g., `ENOBUFS`, `EMFILE`), the goroutine exits with `nil`. Because it does not return a non-nil error, the `errgroup` does not cancel the shared context. This leaves the HTTP server and wait goroutines running indefinitely, permanently breaking the virtual network for this VM without notifying the controller. **Fix**: Cancel the context or return an error if the parent context is not actively shutting down. ```diff conn, err := ln.Accept() if err != nil { - // Listener closed (context cancelled). - return nil //nolint:nilerr // Expected on shutdown. + if ctx.Err() != nil { + return nil + } + return fmt.Errorf("vsock accept failed: %w", err) } ``` **I2. `startHostSwitch` race condition leaves shutting-down state in map** — `pkg/controllers/lima/limavm/controllers/hostswitch_windows.go:114-120` ```go r.hostSwitchMu.Lock() if old, exists := r.hostSwitchStates[name]; exists { r.hostSwitchMu.Unlock() old.cancel() <-old.done r.hostSwitchMu.Lock() } ``` This lock sandwich unlocks the mutex while leaving `old` in the `hostSwitchStates` map. If `shutdownAllHostagents` (or another concurrent caller) calls `stopHostSwitch` during the wait, it will find the shutting-down `old` state, delete it from the map, and wait. `startHostSwitch` will then finish waiting, lock the mutex, and blindly insert the *new* running state into the map—leaving a running host-switch that the manager believes it just successfully stopped. **Fix**: Delegate to the existing safe `stopHostSwitch` method. ```diff - r.hostSwitchMu.Lock() - if old, exists := r.hostSwitchStates[name]; exists { - r.hostSwitchMu.Unlock() - old.cancel() - <-old.done - r.hostSwitchMu.Lock() - } + r.stopHostSwitch(name) + + r.hostSwitchMu.Lock() r.hostSwitchStates[name] = &hostSwitchState{ ``` **I3. `validateSubnet` ignores CIDR mask and produces invalid IPs** — `pkg/controllers/lima/limavm/controllers/hostswitch_windows.go:68-73` ```go ip, _, err := net.ParseCIDR(subnet) // ... ipv4 := ip.To4() // ... tapIP := net.IPv4(ipv4[0], ipv4[1], ipv4[2], 2).String() ``` `validateSubnet` parses the raw string but ignores the CIDR mask, hardcoding the fourth octet to `.1`, `.2`, and `.254`. If a user (or future configuration) provides a valid subnet that isn't a `/24` (e.g., `192.168.127.0/25`), the function will assign `192.168.127.254` as the `StaticDNSHost`, which is completely outside the subnet's valid IP range. **Fix**: Explicitly assert that the mask is `/24`. ```diff - ip, _, err := net.ParseCIDR(subnet) + ip, ipNet, err := net.ParseCIDR(subnet) if err != nil { return nil, fmt.Errorf("invalid subnet %q: %w", subnet, err) } + if ones, _ := ipNet.Mask.Size(); ones != 24 { + return nil, fmt.Errorf("subnet %q must be a /24", subnet) + } ``` #### Suggestions **S1. `attemptHandshake` delays first connection attempt by 1 second** — `pkg/controllers/lima/limavm/controllers/hostswitch_windows.go:276-281` (suggestion, gap) `time.Ticker` fires its first event *after* the specified duration. The initial loop iteration blocks on the select statement, artificially delaying the very first handshake probe by 1 second. Adding an immediate execution path ensures faster guest networking availability. **S2. `signalListenerReady` leaks goroutine on hvsock dial hang** — `pkg/controllers/lima/limavm/controllers/hostswitch_windows.go:319-323` (suggestion, gap) `hvsock.Dial` does not support context cancellation. If the VM's network stack hangs and the dial never completes, the background goroutine spawned in `signalListenerReady` will leak permanently (even after the parent context expires). While this is a known limitation of the underlying Windows AF_VSOCK libraries, it should be documented. ### Design Observations **Concerns:** - The architecture couples the controller to the guest's boot sequence via an explicit handshake and registry scrape. A more resilient design would have the host-switch blindly listen on a predefined vsock port, allowing the guest to simply dial it during boot. Eliminating the reverse discovery phase (and the `getVMGUID` registry loop) would significantly reduce the surface area for timing bugs. `(future)` **Strengths:** - Utilizing an interface/struct-embedding strategy (`hostSwitchPlatform`) across build constraints correctly isolates Windows-specific vsock logic from the main controller loop, ensuring non-Windows builds stay unpolluted. ### Testing Assessment 1. **Host-Switch Lifecycle:** The BATS tests verify the happy path (`ConditionCreated` success). There are no tests covering the lifecycle of the host-switch goroutine across partial boot failures, nor validation that the host-switch survives concurrent `start` and `stop` events. 2. **Crash Recovery:** There is no coverage for what happens if the guest's `network-setup.service` crashes and restarts. The gvisor-tap-vsock tunnel assumes a stable stream; reconnect handling is not validated. ### Documentation Assessment The sequence diagrams in `docs/design/networking.md` provide an excellent, high-level overview of the vsock tunneling strategy. However, the documentation currently lacks troubleshooting guidelines. If a user encounters the "Vsock handshake failed" error, it is unclear whether they should restart the RDD service or force-kill the WSL VM. ### Coverage Summary - `.github/workflows/go-mod-k8s-sync.yaml` — Reviewed, no issues - `bats/tests/32-app-controllers/app.bats` — Reviewed, no issues - `docs/design/api_lima.md` — Trivial - `docs/design/networking.md` — Reviewed, no issues - `go.mod` — Trivial - `pkg/controllers/app/app/lima-images-wsl2.yaml` — Reviewed, no issues - `pkg/controllers/lima/limavm/controllers/hostswitch_other.go` — Reviewed, no issues - `pkg/controllers/lima/limavm/controllers/hostswitch_windows.go` — C1, I1, I2, I3, S1, S2 - `pkg/controllers/lima/limavm/controllers/hostswitch_windows_test.go` — Reviewed, no issues - `pkg/controllers/lima/limavm/controllers/limavm_controller.go` — Reviewed, no issues - `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` — Reviewed, no issues