Deep Review: 20260519-103443-pr-355

Date2026-05-19 10:45
Reporancher-sandbox/rancher-desktop-daemon
Round1
Author@mook-as
PR#355 — K3s: Use a dynamic port
Branchmook-as:dynamic-k3s-port
Commits2db5c92 K3s: Use a dynamic port
ReviewersClaude Opus 4.7 (effort: xhigh), Codex GPT 5.5 (effort: xhigh), Gemini 3.1 Pro (effort: default)
VerdictMerge with fixes — one Important correctness gap (non-SSA status update violates this file's own concurrent-writer contract) plus a port-staleness window that breaches ResolvePort's documented contract. Test coverage and several smaller polish items follow.
Wall-clock time14 min 20 s


Critical Issues

None.


Important Issues

I1. Non-SSA Status().Update on App violates the concurrent-writer contract Claude Codex Gemini
	limaVMErr := r.Get(ctx, client.ObjectKey{Name: limaVMName, Namespace: namespace}, limaVM)
	if limaVMErr != nil && !apierrors.IsNotFound(limaVMErr) {
		return ctrl.Result{}, limaVMErr
	}

	// If the Kubernetes port is not set, set it first.
	if app.Spec.Kubernetes.Enabled && app.Status.KubernetesPort == 0 {
		app.Status.KubernetesPort, err = controllers.ResolvePort(ctx, 7441+instance.Index())
		if err != nil {
			return ctrl.Result{}, fmt.Errorf("failed to resolve Kubernetes port: %w", err)
		}
		if err := r.Status().Update(ctx, &app); err != nil {
			return ctrl.Result{}, fmt.Errorf("failed to update App status: %w", err)
		}
		// The status update causes a requeue.
		return ctrl.Result{}, nil
	}

	if apierrors.IsNotFound(limaVMErr) {
		template, err := applySpecToTemplate(r.LimaTemplateData, app.Spec, app.Status)
		if err != nil {
			return ctrl.Result{}, fmt.Errorf("failed to apply spec to template: %w", err)

App.Status has multiple concurrent writers — AppReconciler mirrors LimaVM conditions and EngineReconciler.setEngineCondition writes ContainerEngineReady on the same object. The file's own later condition update at line 371 spells this out and uses retry.RetryOnConflict with a re-Get for exactly this reason. The new code reuses the app value fetched at line 136 and calls a bare r.Status().Update. When EngineReconciler lands a write between those two points, the API server returns 409, the reconcile errors out, controller-runtime requeues, and the next pass re-runs ResolvePort from scratch — opening and closing a fresh listener each time. The recovery path works, but the port can churn under contention, and the pattern is exactly what the rest of the file is structured to avoid.

// The app controller is a cluster-scoped singleton - only one instance named 'app' is allowed.
func (r *AppReconciler) Reconcile(ctx context.Context, _ ctrl.Request) (ctrl.Result, error) {
	log := logf.FromContext(ctx)

	var app v1alpha1.App
	if err := r.Get(ctx, client.ObjectKey{Name: appName}, &app); err != nil {
		if !apierrors.IsNotFound(err) {
			log.Error(err, "Unable to fetch App singleton")
		}
		return ctrl.Result{}, client.IgnoreNotFound(err)
	}
	// app's resourceVersion from the initial Get can be stale.
	// retry.RetryOnConflict + re-Get matches
	// EngineReconciler.setEngineCondition; without it, concurrent
	// writers 409-loop through controller-runtime requeues.
	engineEnabled := r.engineEnabled(ctx)
	if err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
		latest := &v1alpha1.App{}
		if err := r.Get(ctx, client.ObjectKeyFromObject(&app), latest); err != nil {
			return err
		}
		changed := false

Fix: mirror the existing pattern.

 // If the Kubernetes port is not set, set it first.
 if app.Spec.Kubernetes.Enabled && app.Status.KubernetesPort == 0 {
-    app.Status.KubernetesPort, err = controllers.ResolvePort(ctx, 7441+instance.Index())
+    port, err := controllers.ResolvePort(ctx, 7441+instance.Index())
     if err != nil {
         return ctrl.Result{}, fmt.Errorf("failed to resolve Kubernetes port: %w", err)
     }
-    if err := r.Status().Update(ctx, &app); err != nil {
+    if err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
+        latest := &v1alpha1.App{}
+        if err := r.Get(ctx, client.ObjectKey{Name: appName}, latest); err != nil {
+            return err
+        }
+        if latest.Status.KubernetesPort != 0 {
+            return nil
+        }
+        latest.Status.KubernetesPort = port
+        return r.Status().Update(ctx, latest)
+    }); err != nil {
         return ctrl.Result{}, fmt.Errorf("failed to update App status: %w", err)
     }
     // The status update causes a requeue.
     return ctrl.Result{}, nil
 }
I2. Port probe runs minutes before the actual bind, well past ResolvePort's documented contract Codex Gemini

ResolvePort opens a 127.0.0.1:desired listener, closes it immediately, and returns the port. Its own godoc warns the caller to "call this immediately before the port is needed to minimize the window between releasing and rebinding" (pkg/service/controllers/ports.go:14-18). Here, AppReconciler calls it on first reconcile, persists the result to Status, requeues, then on a later reconcile creates the input ConfigMap, creates the LimaVM, and only minutes later — once Lima provisions the VM and sets up port forwarding — does anything bind the host port. Between the close and the bind, any other process on the host can take the port. Lima's portForwards block (pkg/controllers/app/app/lima-template.yaml:13-21) uses hostPortRange: [0, 0] to mirror the guest port, so a stolen port causes Lima to land on a different host port than the value in App.Status.KubernetesPort. The kubeconfig probe then writes a config pointing at a port no one is serving.

	"fmt"
	"net"
	"strconv"
)

// ResolvePort binds a TCP port on localhost to confirm availability. If the
// desired port is occupied, the OS assigns a random port instead. The listener
// is closed before returning so the caller can rebind it. Call this immediately
// before the port is needed to minimize the window between releasing and
// rebinding.
func ResolvePort(ctx context.Context, desired int) (int, error) {
	lc := net.ListenConfig{}
	ln, err := lc.Listen(ctx, "tcp", "127.0.0.1:"+strconv.Itoa(desired))
	if err != nil {
		ln, err = lc.Listen(ctx, "tcp", "127.0.0.1:0")
firmware:
  legacyBIOS: false
containerd:
  system: false
  user: false
portForwards:
- guestIPMustBeZero: true
  guestPortRange:
  - 1
  - 65535
  hostIP: 0.0.0.0
  hostPortRange:
  - 0
  - 0
- guestPortRange:
  - 0
  - 0
  guestSocket: /var/run/docker.sock
  hostPortRange:

Fix: tighten the resolve/bind window — move the resolve into the component that actually binds, or have that component report the actual host port back to status and treat App.Status.KubernetesPort as observed rather than intended. Short of that, document the contract breach on the field and on the resolve call so a future reader knows the stored port is advisory.

I3. No tests for the new port-resolution branch or applySpecToTemplate parameter Claude

The existing tests cover computeSettledCondition and engineEnabled only. The PR adds:

	limaVMErr := r.Get(ctx, client.ObjectKey{Name: limaVMName, Namespace: namespace}, limaVM)
	if limaVMErr != nil && !apierrors.IsNotFound(limaVMErr) {
		return ctrl.Result{}, limaVMErr
	}

	// If the Kubernetes port is not set, set it first.
	if app.Spec.Kubernetes.Enabled && app.Status.KubernetesPort == 0 {
		app.Status.KubernetesPort, err = controllers.ResolvePort(ctx, 7441+instance.Index())
		if err != nil {
			return ctrl.Result{}, fmt.Errorf("failed to resolve Kubernetes port: %w", err)
		}
		if err := r.Status().Update(ctx, &app); err != nil {
			return ctrl.Result{}, fmt.Errorf("failed to update App status: %w", err)
		}
		// The status update causes a requeue.
		return ctrl.Result{}, nil
	}

	if apierrors.IsNotFound(limaVMErr) {
		template, err := applySpecToTemplate(r.LimaTemplateData, app.Spec, app.Status)
		if err != nil {
			return ctrl.Result{}, fmt.Errorf("failed to apply spec to template: %w", err)

			r := &AppReconciler{Discovery: tt.discovery}
			assert.Equal(t, r.engineEnabled(t.Context()), tt.want)
		})
	}
}

Neither has unit coverage, and no BATS test asserts that app.status.kubernetesPort settles on a sensible value after first reconcile or that the rendered template contains the expected port. The function applySpecToTemplate is pure and easy to cover with a table-driven test on golden output; the reconcile branch is straightforward to cover with the existing fake-client setup.

Fix: add a table-driven unit test for applySpecToTemplate asserting KUBERNETES_PORT: <n> for representative inputs (Kubernetes enabled with port set; Kubernetes disabled with port 0), plus a reconciler test that drives the first-reconcile branch and asserts Status.KubernetesPort becomes non-zero. A BATS check that reads app.status.kubernetesPort and curls the published port for a TLS handshake would close the integration gap.


Suggestions

S1. Pass kubernetesPort, not the whole AppStatus, to applySpecToTemplate Claude
		return true
	}
	return slices.Contains(enabled, v1alpha1.EngineControllerName)
}

func applySpecToTemplate(baseTemplate string, spec v1alpha1.AppSpec, status v1alpha1.AppStatus) (string, error) {
	hostHome, err := os.UserHomeDir()
	if err != nil {
		return "", fmt.Errorf("failed to get host home directory: %w", err)
	}
	return strings.Join([]string{
		baseTemplate,
		"param:",
		fmt.Sprintf("  CONTAINER_ENGINE: %s", spec.ContainerEngine.Name),
		fmt.Sprintf("  HOST_DOCKER_SOCKET: %q", instance.DockerSocket()),
		fmt.Sprintf("  HOST_HOME_GUEST: %q", toLinuxPath(hostHome)),
		fmt.Sprintf("  KUBERNETES_ENABLED: %v", spec.Kubernetes.Enabled),
		fmt.Sprintf("  KUBERNETES_VERSION: %s", spec.Kubernetes.Version),
		fmt.Sprintf("  KUBERNETES_PORT: %d", status.KubernetesPort),
		"",
	}, "\n"), nil
}

// toLinuxPath converts a host path to a Linux-accessible path inside a Lima VM.
// On Windows, os.UserHomeDir() returns a Windows path (e.g. C:\Users\foo).
// Inside a WSL2 Lima VM the Windows filesystem is mounted at /mnt/<drive>/...,
// so we convert it to WSL2 supported path. On other platforms the path is returned unchanged.

AppStatus carries Conditions []metav1.Condition plus the new KubernetesPort int. The function uses only the latter. The wider signature invites a future reader to wonder whether conditions are intentionally baked into the template, and forces the signature to drift every time an unrelated status field is added.

-func applySpecToTemplate(baseTemplate string, spec v1alpha1.AppSpec, status v1alpha1.AppStatus) (string, error) {
+func applySpecToTemplate(baseTemplate string, spec v1alpha1.AppSpec, kubernetesPort int) (string, error) {
     ...
-    fmt.Sprintf("  KUBERNETES_PORT: %d", status.KubernetesPort),
+    fmt.Sprintf("  KUBERNETES_PORT: %d", kubernetesPort),

Update both call sites (lines 244 and 328) to pass app.Status.KubernetesPort.

		// The status update causes a requeue.
		return ctrl.Result{}, nil
	}

	if apierrors.IsNotFound(limaVMErr) {
		template, err := applySpecToTemplate(r.LimaTemplateData, app.Spec, app.Status)
		if err != nil {
			return ctrl.Result{}, fmt.Errorf("failed to apply spec to template: %w", err)
		}
		inputCM := &corev1.ConfigMap{}
		cmErr := r.Get(ctx, client.ObjectKey{Name: inputConfigMapName, Namespace: namespace}, inputCM)
S2. Stored port is sticky across kubernetes.enabled toggle cycles Claude Codex Gemini
	if limaVMErr != nil && !apierrors.IsNotFound(limaVMErr) {
		return ctrl.Result{}, limaVMErr
	}

	// If the Kubernetes port is not set, set it first.
	if app.Spec.Kubernetes.Enabled && app.Status.KubernetesPort == 0 {
		app.Status.KubernetesPort, err = controllers.ResolvePort(ctx, 7441+instance.Index())
		if err != nil {
			return ctrl.Result{}, fmt.Errorf("failed to resolve Kubernetes port: %w", err)
		}
		if err := r.Status().Update(ctx, &app); err != nil {

The guard re-resolves only when KubernetesPort == 0. A user who enables Kubernetes (port resolves to 7443), disables it, and re-enables it minutes later re-uses the stored 7443 with no check that the host port is still free. Combined with the I2 staleness window, this lets the stored port drift arbitrarily far from reality. Either zero KubernetesPort on the enabled true→false transition so the next enable resolves fresh, or document the stickiness in the field's godoc so a future kubeconfig-merge feature can surface the discrepancy when Lima ends up forwarding to a different host port.

S3. # $PARAM_KUBERNETES_PORT validation-marker comment is opaque and fragile Claude
provision:
- mode: system
  script: |
    #!/bin/bash

    # All param need to be mentioned for Lima validation.
    # $PARAM_KUBERNETES_PORT

    set -o errexit -o nounset -o pipefail -o xtrace

    systemctl stop rancher-desktop.target

The existing line is already terse; combined with a comment that looks like a stub variable reference, a future reader has to guess what # $PARAM_KUBERNETES_PORT is doing there. The pattern is also brittle: anyone adding a new param has to remember to add a matching marker comment, even though the for p in "${!PARAM_@}" loop below would otherwise pick it up automatically.

-    # All param need to be mentioned for Lima validation.
-    # $PARAM_KUBERNETES_PORT
+    # Lima validates that every declared `param:` entry is referenced somewhere
+    # in the template. The loop below uses bash indirection that Lima's static
+    # scanner cannot see, so list any params consumed only via ${!PARAM_@} here:
+    #   $PARAM_KUBERNETES_PORT
S4. Design docs still describe the API as hardcoded 6443 Claude Codex

docs/design/networking.md:185 reads:

> - Kubernetes port forwarding: Pre-forward 127.0.0.1:6443 to 192.168.127.2:6443 so kubectl on the host reaches the K8s API inside the VM.

The future-work item still names 6443, contradicting the PR. docs/design/api_app.md's App status example (lines 66-85) shows only conditions — no kubernetesPort, even though the field is now part of the documented API surface.


    Note over HS: Match found for GUID-A
```

The registry is rescanned every second because `startHostSwitch` runs before the hostagent boots the WSL2 VM. On a fresh system where no other WSL2 distro is running, the utility VM may not yet appear in the registry when the first scan runs.

The signature phrase is `"github.com/rancher-sandbox/rancher-desktop/src/go/networking"`. This is a fixed protocol constant shared between host-switch and the guest's `network-setup` binary. Because the signature is product-wide rather than per-instance, this discovery assumes only one opensuse WSL2 instance runs at a time.

### Phase 2: Data Channel (port 6656)

After identifying the VM, the host-switch creates a listener and signals readiness.

```mermaid
sequenceDiagram
    participant HS as Host-Switch
    participant NS as network-setup (guest)
    participant VS as vm-switch (guest)

    HS->>HS: Listen on vsock port 6656
    HS->>NS: Dial port 6669, send "READY"
    NS->>NS: Connect to vsock port 6656
    NS->>VS: Spawn vm-switch (pass vsock FD)
    VS->>VS: Create TAP device (eth0)
    VS->>HS: Ethernet frames over vsock
    HS->>VS: Ethernet frames over vsock
```

Ethernet frames use a simple framing protocol: a 2-byte little-endian size prefix followed by the raw frame payload.

## Virtual Network Services

Update the networking bullet to refer to App.Status.KubernetesPort (or 7441+index) and add kubernetesPort: <n> plus a one-line description to the api_app.md status example.


Design Observations

Concerns

Strengths


Testing Assessment

Untested scenarios, highest risk first:

  1. First reconcile with Spec.Kubernetes.Enabled=true writes Status.KubernetesPort and requeues without creating the LimaVM (I3, the new branch at lines 230-241).
  2. applySpecToTemplate honours status.KubernetesPort in the emitted template (pure function, no current test).
  3. Concurrent-writer race: EngineReconciler.setEngineCondition updates a condition between the Get at line 136 and the Status().Update at line 236; today's code returns an error and is requeued (I1).
  4. applySpecToTemplate when Spec.Kubernetes.Enabled=false and Status.KubernetesPort=0 — verify the template still parses and the embedded KUBERNETES_PORT: 0 does not break downstream consumers.
  5. kubernetes.enabled toggle cycles — disable then re-enable does not re-resolve the port (S2).
  6. End-to-end: rendered Lima template, generated kubeconfig, and App.Status all agree on the selected fallback port when 7441+index is occupied.

go test ./pkg/controllers/app/app/controllers passes on the PR SHA.


Documentation Assessment


Commit Structure

The PR is a single commit (2db5c92 K3s: Use a dynamic port); the diff is self-contained. No fixup history, no scope creep.


Acknowledged Limitations

PR review comment (resolved):


Unresolved Feedback


Agent Performance Retro

Claude

Strongest performance this round. Caught every Important finding raised by any agent, contributed the only test-coverage finding (I3), and added three of the four Suggestions on its own (S1 function-signature cohesion, S3 cryptic marker comment, S4 docs reconciliation). Framed the host-port-reservation race as a design observation rather than an Important — defensible, since the race predates this PR — and complemented the framing with the explicit KUBERNETES_PORT=0-when-disabled observation no other agent surfaced. Worth highlighting: Claude's I1 fix sketch is a drop-in replacement that another reviewer could land verbatim, including an idempotency guard (if latest.Status.KubernetesPort != 0 { return nil }) the other agents' fixes omit.

Codex

Crisp and accurate. Two Important findings (I1 status-update pattern, I2 port-staleness window) both verified against the code, with citations into the engine reconciler's own RetryOnConflict block and the ResolvePort godoc; the I1 fix cites the exact line of the comparable retry block. Codex's I2 uniquely bundled the disable/re-enable staleness angle (Claude split it out as S2) — the bundled framing is reasonable but the split is clearer for tracking fixes. No false positives, no severity downgrades. Codex also ran go test ./pkg/controllers/app/app/controllers directly to confirm the type change compiles.

Gemini

Two false positives ate most of the signal. C1 claimed K3s never sees the dynamic port because /etc/sysconfig/rancher-desktop is not sourced — wrong: the rancher-desktop-opensuse image's k3s.service.d/{environment,command-override}.conf source it and use $KUBERNETES_PORT as --https-listen-port. The image lives in a sibling repo that Gemini did not have access to, but the I-just-checked-the-image-side gap was missed, and the framing escalated a "I cannot see the image side" gap to a Critical regression. I1 (%q breaking systemd parsing as the root cause of Windows BATS failure) is also wrong: the current PR's Windows BATS jobs all pass, and the only env-file values actually consumed by systemd services (CONTAINER_ENGINE, KUBERNETES_ENABLED, KUBERNETES_PORT) have no characters %q would quote, so there is no observable bug. C2 (non-SSA status update) and I2 (port-probe race) reduced to the same Important findings as the other two agents — useful corroboration but no unique insight. As expected from the repo context, Gemini also did not run git blame.

Summary

Claude Opus 4.7Codex GPT 5.5Gemini 3.1 Pro
Duration8m 48s7m 12s
Findings2I 4S2I 1S1I
Tool calls41 (Grep 20, Read 12, Bash 8)45 (shell 45)
Design observations211
False positives002
Unique insights300
Files reviewed444
Coverage misses000
Totals2I 4S2I 1S1I
Downgraded001 (I→S)
Dropped002

Claude was the most valuable reviewer this round — every finding it raised survived consolidation, and three of the seven consolidated findings (I3, S1, S3) were unique to it. Codex was a clean, accurate second reviewer that corroborated Claude's correctness work and contributed the precise EngineReconciler citation for I1's fix. Gemini's contribution after dropping its two false positives is best summarised as "corroborated what the other two already had"; the false-positive cost (one Critical, one Important) was high relative to the unique signal added.

Reconciliation


Review Process Notes

Repo context updates