Deep Review: 20260408-145125-pr-280

Date2026-04-08 14:51
Reporancher-sandbox/rancher-desktop-daemon
Round1
Author@jandubois
PR#280 — Add host-switch virtual network for WSL2 instances
Branchhost-switch
Commits a53d4b9 Add host-switch virtual network for WSL2 instances
ReviewersClaude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro
VerdictMerge with fixes — lifecycle integration gaps (I1-I2) and a concurrency bug (I3) need attention; the core networking logic is sound
Wall-clock time33 min 29 s

Executive Summary

This PR ports the host-switch networking module into the LimaVM controller as an in-process goroutine, enabling WSL2 instances to get network connectivity via a gvisor-tap-vsock virtual L2 network over AF_VSOCK. The implementation is well-structured with clean platform separation and thorough design documentation. The key findings involve lifecycle integration: the host-switch goroutine can leak on restart and startup failure paths, and the vsock discovery loop spawns unbounded detached goroutines that risk a panic on channel close. All issues have straightforward fixes.


Critical Issues

None.


Important Issues

I1. Missing stopHostSwitch in handleRestartNeeded leaks goroutine on restart Claude

Every other shutdown path calls stopHostSwitch alongside stopWatcher: stopInstance at line 502, handleDeletion at line 40, handleWatchedState crash recovery at line 213, and shutdownAllHostagents at line 547. The handleRestartNeeded(phaseRunning) branch is the only path that stops the watcher without stopping the host-switch. On the next reconcile, startHostSwitch overwrites the map entry (line 58) without cancelling the old goroutine, leaking a vsock listener, a virtual network, and the errgroup goroutines.

While handleRestartNeeded predates this PR (confirmed via git blame), the lifecycle pattern it should follow was introduced by this PR.

func (r *LimaVMReconciler) handleRestartNeeded(ctx context.Context, limaVM *v1alpha1.LimaVM, phase instancePhase) (ctrl.Result, error) {
	logger := log.FromContext(ctx)

	switch phase {
	case phaseRunning:
		logger.Info("Restart needed: stopping running instance")
		r.shutdownHostagent(ctx, limaVM.Name, nil)
		r.stopWatcher(limaVM.Name)

		// Clear restartNeeded and set Stopped condition in one write.
		// This is inlined (rather than calling stopInstance) so both changes
		// land in a single patch — stopInstance's updateCondition would take
		// its own DeepCopy and miss the RestartNeeded change.
		patch := client.MergeFrom(limaVM.DeepCopy())
		limaVM.Status.RestartNeeded = false
		r.setCondition(limaVM, ConditionRunning, metav1.ConditionFalse, ReasonStopped, "Stopped for restart")
		return ctrl.Result{}, r.Status().Patch(ctx, limaVM, patch)

Fix:

 	case phaseRunning:
 		logger.Info("Restart needed: stopping running instance")
 		r.shutdownHostagent(ctx, limaVM.Name, nil)
 		r.stopWatcher(limaVM.Name)
+		r.stopHostSwitch(limaVM.Name)
I2. startHostSwitch overwrites state without cancelling existing goroutine; leaked on startInstance failure Claude Codex Gemini

startHostSwitch unconditionally overwrites hostSwitchStates[name] at line 58. If the map already has an entry for this instance (from a previous call), the old goroutine's cancel function is discarded, and the goroutine runs until the manager shuts down.

This manifests through two paths. First, startInstance calls startHostSwitch at line 446, but neither the haCmd.Start() failure at line 454 nor the waitForPIDFile failure at line 463 calls stopHostSwitch. On retry, startHostSwitch orphans the previous goroutine. Second, any code path that calls startHostSwitch twice for the same instance (e.g., I1's restart path) leaks the first goroutine.

// startHostSwitch launches the host-switch goroutine for a WSL2 instance.
// It must be called before the hostagent starts, because the guest's
// network-setup.service performs a vsock handshake during early boot.
func (r *LimaVMReconciler) startHostSwitch(ctx context.Context, name string, inst *limatype.Instance) {
	if inst.VMType != limatype.WSL2 {
		return
	}

	hsCtx, hsCancel := context.WithCancel(ctx)
	done := make(chan struct{})

	r.hostSwitchMu.Lock()
	r.hostSwitchStates[name] = &hostSwitchState{
		cancel: hsCancel,
		done:   done,
	}
	r.hostSwitchMu.Unlock()

	logger := logr.FromContextOrDiscard(ctx).WithValues("instance", name, "component", "host-switch")
	go r.runHostSwitch(hsCtx, logger, done)
}
	// Start the host-switch virtual network for WSL2 instances. This must
	// happen before the hostagent starts, because the guest's
	// network-setup.service performs a vsock handshake during early boot.
	r.startHostSwitch(ctx, limaVM.Name, inst)

	// Start hostagent in background.
	haCmd := exec.CommandContext(ctx, rddPath, args...)
	process.SetGroup(haCmd)
	haCmd.Stdout = haStdoutW
	haCmd.Stderr = haStderrW

	if err := haCmd.Start(); err != nil {
		logger.Error(err, "Failed to start hostagent")
		if updateErr := r.updateCondition(ctx, limaVM, ConditionRunning, metav1.ConditionFalse, ReasonStartFailed, err.Error()); updateErr != nil {
			logger.Error(updateErr, "Failed to update status after hostagent start failure")
		}
		return ctrl.Result{}, err
	}

	// Wait for PID file to be created (indicates hostagent has started)
	if err := r.waitForPIDFile(haPIDPath, haStderrPath); err != nil {
		logger.Error(err, "Hostagent did not start")
		if updateErr := r.updateCondition(ctx, limaVM, ConditionRunning, metav1.ConditionFalse, ReasonStartFailed, err.Error()); updateErr != nil {
			logger.Error(updateErr, "Failed to update status after hostagent startup failure")
		}
		return ctrl.Result{}, err
	}

Fix (two parts): Make startHostSwitch idempotent — stop any existing goroutine before starting a new one:

 func (r *LimaVMReconciler) startHostSwitch(ctx context.Context, name string, inst *limatype.Instance) {
 	if inst.VMType != limatype.WSL2 {
 		return
 	}

+	r.stopHostSwitch(name)
+
 	hsCtx, hsCancel := context.WithCancel(ctx)

Add cleanup on startInstance error paths after line 446:

 	if err := haCmd.Start(); err != nil {
 		logger.Error(err, "Failed to start hostagent")
+		r.stopHostSwitch(limaVM.Name)
 		...
 		return ctrl.Result{}, err
 	}

 	if err := r.waitForPIDFile(haPIDPath, haStderrPath); err != nil {
 		logger.Error(err, "Hostagent did not start")
+		r.stopHostSwitch(limaVM.Name)
 		...
 		return ctrl.Result{}, err
 	}
I3. Unbounded detached goroutines in attemptHandshake with potential panic on channel close Codex Gemini

attemptHandshake spawns a new detached goroutine every second (line 299). Each goroutine blocks in hvsock.Dial, which accepts no context and has no built-in timeout. On a host with several unresponsive Hyper-V VMs, goroutines accumulate at one per GUID per second.

Worse, these detached goroutines outlive the WaitGroup. When getVMGUID receives a match from found and returns, defer cancel() fires, all attemptHandshake functions return, wg.Wait() completes, and close(found) executes. A detached goroutine that completed its dial before cancellation but reaches the found <- vmGUID send after close(found) triggers a panic. The select { ... default: } does not protect against sends on a closed channel.

Additionally, readSignature (line 331) and signalListenerReady (line 340) perform I/O without deadlines. A peer that connects but sends incomplete data blocks the goroutine indefinitely.

func attemptHandshake(ctx context.Context, logger logr.Logger, vmGUID hvsock.GUID, found chan<- hvsock.GUID) {
	ticker := time.NewTicker(time.Second)
	defer ticker.Stop()

	for {
		select {
		case <-ctx.Done():
			return
		case <-ticker.C:
			// Dial in a goroutine to avoid blocking on hvsock.Dial timeout.
			go func() {
				conn, err := getVsockConnection(vmGUID, vsockHandshakePort)
				if ctx.Err() != nil {
					if conn != nil {
						conn.Close()
					}
					return
				}
				if err != nil {
					logger.V(1).Info("Handshake dial failed", "vmGUID", vmGUID.String(), "error", err)
					return
				}

				sig, err := readSignature(conn)
				conn.Close()
				if err != nil {
					logger.V(1).Info("Handshake read failed", "vmGUID", vmGUID.String(), "error", err)
					return
				}
				if sig == signaturePhrase {
					logger.V(1).Info("Signature matched", "vmGUID", vmGUID.String())
					select {
					case found <- vmGUID:
					default:
					}
				}
			}()
		}
	}
}
// readSignature reads the signature phrase from the peer.
func readSignature(conn net.Conn) (string, error) {
	buf := make([]byte, len(signaturePhrase))
	if _, err := io.ReadFull(conn, buf); err != nil {
		return "", err
	}
	return string(buf), nil
}

// signalListenerReady tells the guest that the data listener is ready.
func signalListenerReady(vmGUID hvsock.GUID) error {
	conn, err := getVsockConnection(vmGUID, vsockHandshakePort)
	if err != nil {
		return err
	}
	defer conn.Close()
	_, err = conn.Write([]byte(readySignal))
	return err
}

Fix: Make the handshake sequential and add I/O deadlines. This eliminates the goroutine leak, the panic risk, and the missing-deadline problem in one change:

func attemptHandshake(ctx context.Context, logger logr.Logger, vmGUID hvsock.GUID, found chan<- hvsock.GUID) {
	for {
		if ctx.Err() != nil {
			return
		}
		conn, err := getVsockConnection(vmGUID, vsockHandshakePort)
		if err != nil {
			select {
			case <-ctx.Done():
				return
			case <-time.After(time.Second):
			}
			continue
		}
		conn.SetReadDeadline(time.Now().Add(5 * time.Second))
		sig, err := readSignature(conn)
		conn.Close()
		if err != nil {
			logger.V(1).Info("Handshake read failed", "vmGUID", vmGUID.String(), "error", err)
			continue
		}
		if sig == signaturePhrase {
			logger.V(1).Info("Signature matched", "vmGUID", vmGUID.String())
			found <- vmGUID
			return
		}
		// Wrong VM — stop probing this GUID.
		return
	}
}

Also add a write deadline in signalListenerReady:

 func signalListenerReady(vmGUID hvsock.GUID) error {
 	conn, err := getVsockConnection(vmGUID, vsockHandshakePort)
 	if err != nil {
 		return err
 	}
 	defer conn.Close()
+	conn.SetWriteDeadline(time.Now().Add(5 * time.Second))
 	_, err = conn.Write([]byte(readySignal))
 	return err
 }
I4. Vsock listener leaked on early return in runHostSwitch Claude

After vsockHandshake returns a listener at line 96, two error paths return without closing it: virtualnetwork.New failure (line 104) and vn.Listen failure (line 135). The listener holds an AF_VSOCK socket bound to a VM GUID + service ID. Each leaked listener blocks that service ID until the socket is garbage collected.

func (r *LimaVMReconciler) runHostSwitch(ctx context.Context, logger logr.Logger, done chan struct{}) {
	defer close(done)

	subnet, err := validateSubnet(defaultSubnet)
	if err != nil {
		logger.Error(err, "Invalid subnet configuration")
		return
	}

	ln, err := vsockHandshake(ctx, logger)
	if err != nil {
		logger.Error(err, "Vsock handshake failed")
		return
	}

	cfg := newVirtualNetworkConfig(*subnet)
	vn, err := virtualnetwork.New(&cfg)
	if err != nil {
		logger.Error(err, "Failed to create virtual network")
		return
	}

	g, ctx := errgroup.WithContext(ctx)

	// Accept vsock connections and feed them into the virtual network.
	g.Go(func() error {
		for {
			conn, err := ln.Accept()
			if err != nil {
				// Listener closed (context cancelled).
				return nil //nolint:nilerr // Expected on shutdown.
			}
			if err := vn.AcceptStdio(ctx, conn); err != nil {
				logger.Error(err, "Failed to accept connection into virtual network")
			} else {
				logger.Info("Accepted vsock connection")
			}
		}
	})

	// Close the listener when the context is cancelled.
	g.Go(func() error {
		<-ctx.Done()
		return ln.Close()
	})

	// Start the port-forwarding API server inside the virtual network.
	apiAddr := fmt.Sprintf("%s:80", cfg.GatewayIP)
	vnLn, err := vn.Listen("tcp", apiAddr)
	if err != nil {
		logger.Error(err, "Failed to listen on API address", "addr", apiAddr)
		return
	}

Fix: Add explicit ln.Close() to each early-return path after line 96:

 	cfg := newVirtualNetworkConfig(*subnet)
 	vn, err := virtualnetwork.New(&cfg)
 	if err != nil {
+		ln.Close()
 		logger.Error(err, "Failed to create virtual network")
 		return
 	}
 ...
 	vnLn, err := vn.Listen("tcp", apiAddr)
 	if err != nil {
+		ln.Close()
 		logger.Error(err, "Failed to listen on API address", "addr", apiAddr)
 		return
 	}

Suggestions

S1. HTTP Serve returns a non-ErrServerClosed error on clean shutdown Claude Gemini

On shutdown, closing vnLn causes s.Serve to return a network error (not http.ErrServerClosed, which only Server.Shutdown/Server.Close produce). This passes the error filter, causing g.Wait() to return an error and log "Host-switch exited with error" on every clean shutdown.

	g.Go(func() error {
		<-ctx.Done()
		return vnLn.Close()
	})
	g.Go(func() error {
		s := &http.Server{
			Handler:      mux,
			ReadTimeout:  10 * time.Second,
			WriteTimeout: 10 * time.Second,
		}
		err := s.Serve(vnLn)
		if err != nil && !errors.Is(err, http.ErrServerClosed) {
			return err
		}
		return nil
	})

Fix: Also filter net.ErrClosed:

 		err := s.Serve(vnLn)
-		if err != nil && !errors.Is(err, http.ErrServerClosed) {
+		if err != nil && !errors.Is(err, http.ErrServerClosed) && !errors.Is(err, net.ErrClosed) {
 			return err
 		}
S2. No unit tests for platform-independent helpers Claude Codex Gemini

validateSubnet is pure logic in hostswitch.go (no build tag) and trivially testable on any platform. newVirtualNetworkConfig maps the subnet struct into gvisor-tap-vsock configuration and is also portable.

Fix: Add hostswitch_test.go (no build tag) with table-driven tests for validateSubnet covering the default subnet, a custom subnet, an invalid CIDR, and an IPv6 CIDR. Optionally test newVirtualNetworkConfig to verify DNS zones, DHCP leases, and NAT mappings match the subnet.


Design Observations

Concerns

(future) startHostSwitch is fire-and-forget: it spawns a goroutine and returns immediately. If the goroutine fails (handshake timeout, virtual network error), the reconciler has no signal. The guest's network-setup.service blocks, the hostagent eventually times out, and the watcher reports phaseStopped — but the root cause is buried in logs. A future improvement could have the goroutine report failure back to the reconciler (via a status field or reconcile enqueue) so networking failures surface in the LimaVM conditions. Codex Claude

Strengths


Testing Assessment

  1. No integration tests for host-switch lifecycle — The host-switch requires a running WSL2 VM, which limits integration testing. However, the lifecycle integration (start/stop ordering, error path cleanup) could be tested by verifying state map operations and cancel function calls with mock interfaces.
  2. No unit tests for validateSubnet — Pure function, platform-independent, trivially testable (see S2).
  3. No unit tests for newVirtualNetworkConfig — Verifying DNS zones, DHCP leases, and NAT mappings catches configuration regressions.
  4. No test for the host-switch dying after startup succeeds — The most important recovery path introduced by this PR is unverified.

Documentation Assessment

The networking.md design doc is comprehensive and well-structured. The api_lima.md cross-reference is appropriate. Once I1 is fixed, the doc should describe what happens when the host-switch exits unexpectedly — the crash recovery section covers the hostagent but not the host-switch.


Agent Performance Retro

Claude

Claude Opus 4.6 read all 9 changed files plus the hostagent watcher, then ran 7 git blame commands to verify which changes were introduced by this PR. It identified the most architecturally significant finding (I1: missing stopHostSwitch in handleRestartNeeded) by systematically tracing every shutdown path and confirming the gap with git blame. It also found the vsock listener leak (I4) that both other agents missed. The HTTP shutdown issue was correctly scoped as a suggestion.

Claude did not flag the attemptHandshake goroutine leak or the panic risk. Its Coverage Summary accounts for all files, but the handshake concurrency analysis was shallow — it praised the parallel design as a strength without examining the goroutine lifecycle.

Tool call highlights: 23 calls. 10 Read, 8 Bash (7 git blame, 1 git diff), 5 Grep. Heavy git blame usage explains the accurate regression vs. gap classifications.

Codex

Codex GPT 5.4 made 52 tool calls — the highest count — reading all changed files, test files, the hostagent watcher, and even upstream dependency source code (gvisor-tap-vsock, virtsock). It ran go test to check for existing test coverage and verified the WSL2 image integration path. The finding about host-switch failures being invisible to the reconciler (folded into the design observation) demonstrates good architectural thinking.

Codex identified the goroutine overwrite (I2) and unbounded probing (I3) but did not flag the panic risk on the closed channel or the missing handleRestartNeeded path (I1). It also missed the vsock listener leak (I4).

Tool call highlights: 52 calls. 51 exec_command, 1 write_stdin. Shell commands: 21 sed -n (file reads), 11 nl (numbered listings), 9 rg (search), 5 git blame. Codex used go doc to inspect upstream APIs and go test to verify test state — more thorough upstream analysis than the other agents.

Gemini

Gemini 3.1 Pro made 13 tool calls — the fewest. It focused on the hostswitch_windows.go file (3 reads) and attempted a live build (make build). It identified the closed-channel panic (part of I3) that Codex missed, provided a complete sequential-handshake fix, and flagged the missing I/O deadlines. The HTTP server shutdown finding was initially rated important but is correctly a suggestion.

Gemini raised a concern about context lifetime decoupling — that the reconcile context might be per-reconcile in controller-runtime v0.15+. This is invalid for v0.23.3, where the reconcile context is the manager's context (the existing startWatcher uses the same pattern). Gemini did not run git blame, so it could not distinguish regressions from pre-existing issues.

Tool call highlights: 13 calls. 7 grep_search, 4 read_file, 2 run_shell_command (1 attempted live test, 1 make build). No git blame usage — consistent with the known limitation.

Summary

Claude Opus 4.6 Codex GPT 5.4 Gemini 3.1 Pro
Duration340s228s257s
Findings0C 2I 2S0C 2I 1S0C 2I 1S
Tool calls23 (Read 10, Bash 8, Grep 5)52 (exec_command 51)13 (grep_search 7, read_file 4)
Design observations1 concern, 4 strengths1 concern, 1 strength1 concern (invalid)
False positives001 (context lifetime)
Unique insightsI1 (handleRestartNeeded), I4 (listener leak)upstream API inspectionPanic on closed channel, I/O deadlines
Files reviewed9/99/97/9
Coverage misses002 (api_lima.md, lima-images-wsl2.yaml marked trivial)

Reconciliation: Gemini C1 (critical, goroutine leak + panic): downgraded to important I3 — the panic requires a specific timing window (detached goroutine must complete dial before context cancellation but send after channel close) that is unlikely under normal operation. Gemini C2 (critical, missing I/O deadlines): merged into I3 fix recommendation as part of the sequential handshake — the deadlines are needed for the fix to work correctly, not a standalone critical issue. Gemini I2 (important, HTTP shutdown): downgraded to suggestion S1 — spurious log does not affect functionality. Codex I1 (important, host-switch failure invisible): reclassified as design observation — the system self-heals through the watcher; visibility is a future improvement.

Claude provided the most thorough lifecycle analysis (I1, I4), Codex the deepest upstream investigation, and Gemini the strongest concurrency analysis (panic risk, I/O deadlines).


Review Process Notes

Repo context updates