Deep Review: 20260512-133636-pr-350

Date2026-05-12 13:36
Reporancher-sandbox/rancher-desktop-daemon
Round1 (of target)
Author@Nino-K
PR#350 — Wire up docker socket windows
Commits0322994 Fixes test 104 stopping VM removes all mirror resources
e917657 Wire Docker socket forwarding for Windows
b5ab872 Bump openSUSE to v0.2.3
ReviewersClaude Opus 4.7 (effort: xhigh), Codex GPT 5.5 (effort: xhigh), Gemini 3.1 Pro (effort: default)
VerdictMerge with fixes — bound waitMirrorResourcesGone to prevent reconciler wedge; register engine controller in standalone containers-controller binary.
Wall-clock time16 min 51 s


Executive Summary

The PR wires Windows Docker socket forwarding by introducing instance.DockerEndpoint() (npipe:////./pipe/docker_engine on Windows, unix://… on Unix), removes the Windows early-return guards in the engine controller and Docker-context management, accepts the npipe scheme in probeNamedDockerContext, and re-enables engine-docker.bats on Windows with cygpath/is_windows shims. The third commit fixes test 104 by tightening computeSettledCondition to require a terminal engine reason (Stopped/NotApplicable) at the current generation before declaring Settled=True, switching cleanupMirrorResources reads to apiReader, and polling until the API server reports the mirrors gone. The condition logic is correctly scoped and well-tested. The remaining concerns are: (1) the new waitMirrorResourcesGone polls an unbounded loop using a reader that bypasses the very cache it claims to wait for, and (2) the standalone containers-controller binary still does not blank-import the engine package — a pre-existing gap now made visible by Windows support.


Critical Issues

None.


Important Issues

I1. waitMirrorResourcesGone is unbounded and reads etcd, not the watch cache Claude Gemini
}

// waitMirrorResourcesGone polls until all four mirror resource types
// are empty in the API server's watch cache. It returns an error if
// the context is cancelled before the list drains.
func (r *EngineReconciler) waitMirrorResourcesGone(ctx context.Context) error {
	log := logf.FromContext(ctx)
	for {
		remaining, err := r.countMirrorResources(ctx)
		if err != nil {
			return fmt.Errorf("error verifying mirror resources gone: %w", err)
		}
		if remaining == 0 {
			return nil
		}
		log.V(1).Info("Waiting for mirror resources to disappear from watch cache",
			"remaining", remaining)
		select {
		case <-ctx.Done():
			return fmt.Errorf("timed out waiting for %d mirror resources to be deleted: %w",
				remaining, ctx.Err())
		case <-time.After(200 * time.Millisecond):
		}
	}
}

// countMirrorResources returns the total number of mirror resources
// still visible in the API server for all four types.
func (r *EngineReconciler) countMirrorResources(ctx context.Context) (int, error) {
	lists := []client.ObjectList{
		&containersv1alpha1.ContainerList{},
		&containersv1alpha1.VolumeList{},
		&containersv1alpha1.ImageList{},
		&containersv1alpha1.ContainerNamespaceList{},
	}
	var total int
	for _, list := range lists {
		if err := r.apiReader.List(ctx, list, client.InNamespace(r.apiNamespace)); err != nil {
			return 0, err
		}
		items, err := meta.ExtractList(list)
		if err != nil {
			return 0, err
		}
		total += len(items)
	}
	return total, nil
}

// deleteAllOfType lists and removes every resource of the given kind
// in r.apiNamespace. Engine is authoritative for Container, Image, and
// Volume in the App's namespace: this loop deletes every object, not
// just engine-owned mirrors. Coexistence with another writer to these

Two problems compound. First, pkg/service/controllers/manager.go:143-153 sets no ReconciliationTimeout, so the ctx passed in is the manager's long-lived stop context. The loop exits only when every mirror disappears or the manager shuts down. A Container/Image/Volume created between stopWatcher() and the snapshot list in deleteAllOfType, a user-added finalizer that is not mirrorFinalizer (only that one is stripped at engine_reconciler.go:593), or a storage hiccup leaves an object visible forever. Because App is a cluster-scoped singleton with at most one in-flight Reconcile, a stuck loop blocks every subsequent state transition — rdd set running=true afterwards never runs.

	if err != nil {
		return fmt.Errorf("failed to resolve passthrough port: %w", err)
	}

	// Create the shared controller-runtime manager
	managerOptions := ctrl.Options{
		Scheme: managerScheme,
		Metrics: server.Options{
			BindAddress: "127.0.0.1:" + strconv.Itoa(scm.metricsPort),
		},
		HealthProbeBindAddress: "127.0.0.1:" + strconv.Itoa(scm.healthPort),
		LeaderElection:         false, // RDD controllers are single-instance
		// Limit graceful shutdown time to avoid blocking external controller exit.
		// Default is 30s which is too long when control plane disappears.
		GracefulShutdownTimeout: ptr.To(10 * time.Second),
	}

	// Only configure webhook server if controllers require it
	if scm.hasWebhookControllers() {
		// Use instance TLS directory for webhook certificates (persistent storage)
		webhookCertDir := instance.TLSDir()
				if apierrors.IsNotFound(err) {
					return nil
				}
				return err
			}
			if !controllerutil.RemoveFinalizer(latest, mirrorFinalizer) {
				return nil
			}
			return r.Update(ctx, latest)
		})
		if retryErr != nil {

Second, the loop's stated purpose (the function comment) is to "Block until the API server's watch cache reflects the deletions". apiReader is the manager's GetAPIReader() — it bypasses every cache and issues a quorum etcd read (no resourceVersion=0). DELETE commits to etcd synchronously, so apiReader.List returns 0 immediately. The user-facing rdd ctl get reads with resourceVersion=0 from the API server's watch cache, which lags etcd by the goroutine that processes the watch event. The poll exits before the cache it claims to drain is drained; the test passes today because the other fixes in this commit (the ObservedGeneration guard on alreadyClean, plus reads via apiReader in deleteAllOfType) close the actual race.

Fix: bound the wait with context.WithTimeout, and switch countMirrorResources to the controller-runtime cache (r.List) so the loop drains a cache downstream of the API server's watch cache.

 func (r *EngineReconciler) waitMirrorResourcesGone(ctx context.Context) error {
     log := logf.FromContext(ctx)
+    ctx, cancel := context.WithTimeout(ctx, 30*time.Second)
+    defer cancel()
     for {
 func (r *EngineReconciler) countMirrorResources(ctx context.Context) (int, error) {
     ...
     for _, list := range lists {
-        if err := r.apiReader.List(ctx, list, client.InNamespace(r.apiNamespace)); err != nil {
+        if err := r.List(ctx, list, client.InNamespace(r.apiNamespace)); err != nil {
             return 0, err
         }

Note: r.List reads the controller's informer cache, which is populated only for types the controller Watches. Container, Image, and Volume are watched; ContainerNamespace is not (see I2 below), so this fix should land alongside watching ContainerNamespace or accept the small risk that one ContainerNamespace slips by on the rare delete-during-stop window.

I2. Standalone containers-controller binary does not register the engine controller Codex
package main

import (
	"os"

	// Import app controller packages to trigger init() functions.
	_ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/containers"
	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/external"
)

func main() {
	os.Exit(external.RunControllers("containers"))
}
// pkg/controllers/containers/controller.go
import (
    // Import controllers to register them.
    _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/containers/container"
    _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/containers/containernamespace"
    _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/containers/image"
    _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/containers/volume"
)

The engine controller registers itself in pkg/controllers/app/engine/controller.go:25-26 via init() and declares the containers API group (pkg/controllers/app/engine/controller.go:59-60). The embedded service imports it at pkg/service/cmd/service.go:59, but the standalone containers-controller aggregator imports only the four container-CRD packages. --controllers help text at pkg/service/controllers/options.go:18 advertises containers,-engine, implying engine is part of the containers group, yet bin/containers-controller silently omits it. The repo context flags exactly this shape: "a missing import in either is a silent drop with no startup error."

	containersv1alpha1 "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/apis/containers/v1alpha1"
	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/app/engine/controllers"
	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/base"
)

func init() {
	base.RegisterController(newController())
}

// apiGroup is "containers" because the reconciler watches
// Container, Image, and Volume from that group and needs
// their CRDs at startup. Grouping engine with its dependencies

func (c *controller) GetName() string {
	return appv1alpha1.EngineControllerName
}

func (c *controller) GetAPIGroup() string {
	return apiGroup
}

func (c *controller) GetCRDData() string {
	return ""
}

	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/cli/help"
	// Import controller packages to trigger init() functions for embedded mode.
	_ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/app/app"
	_ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/app/demo"
	_ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/app/engine"
	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/base"
	// Import built-in system controllers.
	_ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/builtin/namespace"
	// Import controller packages to trigger init() functions for embedded mode.
	_ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/containers"
// Shared between the embedded service and any standalone binary that
// surfaces the same flag so the two cannot drift.
const ControllersFlagUsage = "Controllers to enable. Use '*' for all, " +
	"or a comma-separated list of controller or API-group names. " +
	"Groups: rdd, app, containers, lima. " +
	"Prefix a name with '-' to exclude it, e.g. '*,-demo' or 'containers,-engine'."

// Options holds the controller configuration options.
type Options struct {
	Controllers string // Controller selection specification (--controllers flag)
}

The gap pre-dates this PR (engine has always lived under pkg/controllers/app/engine while declaring group containers), but Windows support makes engine reachable on every platform now and amplifies the consequence: a user who runs bin/containers-controller standalone on Windows gets Container/Image/Volume CRDs but no mirroring.

Fix: blank-import the engine package from the containers aggregator so every consumer of controllers/containers gets the full group.

 import (
     // Import controllers to register them.
+    _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/app/engine"
     _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/containers/container"
     _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/containers/containernamespace"
     _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/containers/image"
     _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/containers/volume"
 )

Suggestions

S1. Stale "tcp/unix" wording in probeNamedDockerContext after adding npipe Claude Codex
}

// probeNamedDockerContext pings the Docker endpoint for the named context
// within dockerContextProbeTimeout and returns true if the endpoint is healthy.
//
// When we cannot determine whether the context is healthy — because the store
// is unreadable, the endpoint metadata is missing or malformed, the scheme is
// not tcp/unix (e.g. SSH), or the TLS config cannot be loaded — we return true
// (assume healthy). This conservatism prevents RDD from clobbering a context
// that is in the middle of being edited or migrated.
func probeNamedDockerContext(ctx context.Context, name string) bool {
	log := logf.FromContext(ctx).WithName("docker-context").WithValues("context", name)

	s, err := newContextStore()
	if err != nil {
		log.Error(err, "Cannot open context store; assuming context healthy")
		return true
	}
	md, err := s.GetMetadata(name)
	if err != nil {
		if errdefs.IsNotFound(err) {
			return false
		}
		log.Error(err, "Cannot read context metadata; assuming context healthy")
		return true
	}
	epMeta, err := docker.EndpointFromContext(md)
	if err != nil {
		log.Error(err, "Cannot decode docker endpoint; assuming context healthy")
		return true
	}
	scheme, _, _ := strings.Cut(epMeta.Host, "://")
	switch scheme {
	case "unix", "tcp", "npipe":
	default:
		log.Info("Non-tcp/unix endpoint scheme; assuming context healthy", "scheme", scheme)
		return true
	}
	ep, err := docker.WithTLSData(s, name, epMeta)
	if err != nil {
		log.Error(err, "Cannot load TLS data; assuming context healthy")
		return true

The PR adds "npipe" to the supported schemes at line 183 but leaves both the function docstring at line 157 and the structured-log message at line 185 saying "tcp/unix". Future readers will see the disagreement; the log message also misrepresents the actual decision when an ssh:// context appears.

Fix:

-// not tcp/unix (e.g. SSH), or the TLS config cannot be loaded — we return true
+// not tcp/unix/npipe (e.g. SSH), or the TLS config cannot be loaded — we return true
 ...
-        log.Info("Non-tcp/unix endpoint scheme; assuming context healthy", "scheme", scheme)
+        log.Info("Unprobed endpoint scheme; assuming context healthy", "scheme", scheme)
S2. BATS test uses $(cygpath -m …) command substitution where repo style prefers run -0 Claude
    #
    # On Windows, rdd.exe is a native binary: it interprets /tmp/... as a
    # path from the drive root (C:\tmp\...) rather than MSYS2's /tmp which
    # maps to a different location. cygpath -m produces a mixed-format path
    # (C:/msys64/...) that both native Windows processes and MSYS2 agree on.
    if is_windows; then
        DOCKER_CONFIG="$(cygpath -m "${BATS_FILE_TMPDIR}/docker-config")"
    else
        DOCKER_CONFIG="${BATS_FILE_TMPDIR}/docker-config"
    fi
    export DOCKER_CONFIG
    mkdir -p "${DOCKER_CONFIG}"

    # Deliberately skip setup_rdd_control_plane here: `rdd set` internally
    # calls ensureServiceRunning, which is exactly the CLI path we want to

$(cygpath -m …) swallows a non-zero exit and silently produces an empty string; mkdir -p "" then fails with a less actionable error than cygpath itself would. Repo BATS style prefers run -0 cmd …; value=${output} so command, status, and output all surface on failure.

Fix:

 if is_windows; then
-    DOCKER_CONFIG="$(cygpath -m "${BATS_FILE_TMPDIR}/docker-config")"
+    run -0 cygpath -m "${BATS_FILE_TMPDIR}/docker-config"
+    DOCKER_CONFIG=${output}
 else
     DOCKER_CONFIG="${BATS_FILE_TMPDIR}/docker-config"
 fi
S3. Engine reason strings duplicated across the app and engine packages without shared constants Claude
		//     "Connected/M+1" condition — written while the VM was still
		//     running on an earlier reconcile that saw the spec change but
		//     not yet the stopped VM — would incorrectly satisfy the wait,
		//     causing `rdd set running=false` to return before mirror
		//     resources (Containers, Images, Volumes) are deleted.
		engineSettled := engineCond != nil &&
			engineCond.ObservedGeneration >= app.Generation &&
			(engineCond.Reason == "Stopped" || engineCond.Reason == "NotApplicable")
		if !engineSettled {
			settled.Status = metav1.ConditionFalse
			settled.Reason = v1alpha1.AppSettledReasonEngineStale
			settled.Message = settledMessageEngineStale
		} else {
// engine_reconciler.go
terminalReason := "Stopped"
...
if running && !engineIsDocker {
    terminalReason = "NotApplicable"
...
if err := r.setEngineCondition(ctx, &app, metav1.ConditionTrue, "Connected", "Container engine synced"); err != nil {

computeSettledCondition now hard-depends on the engine reconciler's choice of reason strings. A typo or rename on one side silently breaks the wait. pkg/apis/app/v1alpha1/app_types.go already exports AppConditionContainerEngineReady as a constant; the reason values deserve the same treatment.

Fix: declare EngineReasonStopped, EngineReasonNotApplicable, EngineReasonConnected, EngineReasonConnectFailed in pkg/apis/app/v1alpha1/app_types.go and reference them from both packages.

S4. instance.DockerSocket() docstring claims "for this instance" but the Windows return is process-global Claude Gemini
// length constraints.
var LimaHome = sync.OnceValue(func() string {
	return filepath.Join(ShortDir(), "lima")
})

// DockerSocket returns the path to the Docker socket for this instance
// (e.g., ~/.rd2/docker.sock). This is the host-side socket that Lima
// port-forwards from the guest's /var/run/docker.sock.
// On Windows, returns the named pipe path (\\.\pipe\docker_engine).
var DockerSocket = sync.OnceValue(func() string {
	if runtime.GOOS == "windows" {
		return `\\.\pipe\docker_engine`
	}
	return filepath.Join(ShortDir(), "docker.sock")
})

// DockerEndpoint returns the full Docker endpoint URL for this instance.
// On Windows this is a named-pipe URL (npipe:////./pipe/docker_engine);
// on Unix it is a unix-socket URL (unix:///path/to/docker.sock).
var DockerEndpoint = sync.OnceValue(func() string {

On Unix the path embeds ShortDir() (which factors in RDD_INSTANCE), so two instances coexist. On Windows the pipe is the well-known Docker name and is also hardcoded in pkg/socketbridge/host_windows.go:23, so two RDD instances will collide if both run, and an installed Docker Desktop already owns the same pipe. The well-known name is a deliberate choice (Docker CLI defaults work without extra setup), but the docstring should not promise per-instance isolation it does not deliver.

)

const (
	// DockerPipeName is the well-known Windows named pipe that Docker CLI
	// and Docker Desktop use to reach dockerd.
	DockerPipeName = `\\.\pipe\docker_engine`

	// VsockForwardPort is the AF_VSOCK port that rdd-guest listens on inside
	// the Lima VM.  Must match the constant in cmd/rdd-guest/main.go.
	VsockForwardPort uint32 = 6660
)

Fix at minimum: update the docstring to reflect the Windows leg's one-instance constraint. A per-instance pipe (e.g., \\.\pipe\docker_engine_<suffix>) is out of scope here because it would break the Docker CLI's default endpoint.

-// DockerSocket returns the path to the Docker socket for this instance
-// (e.g., ~/.rd2/docker.sock). This is the host-side socket that Lima
-// port-forwards from the guest's /var/run/docker.sock.
-// On Windows, returns the named pipe path (\\.\pipe\docker_engine).
+// DockerSocket returns the path to the Docker socket. On Unix this is the
+// per-instance host-side socket that Lima port-forwards from the guest's
+// /var/run/docker.sock (e.g., ~/.rd2/docker.sock). On Windows it is the
+// well-known named pipe \\.\pipe\docker_engine — process-global, so at
+// most one RDD instance can serve Docker on Windows at a time.
S5. run -0 jq at line 790 leaves output unused on the non-Windows branch, and local appears inside an @test block Codex Gemini
    assert_file_exists "${meta_file}"

    run -0 jq -r '.Name' "${meta_file}"
    assert_output "${context_name}"

    run -0 jq -r '.Endpoints.docker.Host' "${meta_file}"
    if is_windows; then
        assert_output "npipe:////./pipe/docker_engine"
    else
        run -0 rdd service paths docker_socket
        local socket_path=${output}
        run -0 jq -r '.Endpoints.docker.Host' "${meta_file}"
        assert_output "unix://${socket_path}"
    fi
}

@test "moby engine sets currentContext when no healthy context exists" {
    local context_name="rancher-desktop-${RDD_INSTANCE}"

The run -0 at line 790 is consumed only by the Windows branch's assert_output; on the non-Windows branch its ${output} is discarded and the same jq is re-run at line 796. local socket_path=… at line 795 sits inside an @test body, which the repo's BATS style flags (each @test runs in its own subshell already).

Fix:

-    run -0 jq -r '.Endpoints.docker.Host' "${meta_file}"
     if is_windows; then
+        run -0 jq -r '.Endpoints.docker.Host' "${meta_file}"
         assert_output "npipe:////./pipe/docker_engine"
     else
         run -0 rdd service paths docker_socket
-        local socket_path=${output}
+        socket_path=${output}
         run -0 jq -r '.Endpoints.docker.Host' "${meta_file}"
         assert_output "unix://${socket_path}"
     fi

Design Observations

Concerns

Strengths


Testing Assessment

Test_computeSettledCondition is comprehensive for the new branches: every transition added in the third commit has a case. Two gaps remain. First, no unit test exercises the npipe scheme branch in probeNamedDockerContext; that path is reachable only via Windows BATS today. Seeding a seedContext(t, "npipe-ctx", "npipe:////./pipe/some-pipe") case in Test_probeNamedDockerContext would lock it down on non-Windows CI. Second, no test covers I1's wedge scenario — a unit test that injects a fake apiReader returning a non-empty list and verifies waitMirrorResourcesGone honours a context deadline would make the bound regression-safe. Third, neither cygpath BATS shim nor the npipe assert_output is exercised on non-Windows CI; if the Windows runner is intermittent, the Windows-only branches drift.


Documentation Assessment

The new instance.DockerEndpoint() is documented well. The pre-existing DockerSocket doc is now misleading on Windows (S4) and the probeNamedDockerContext doc and log message drift from the code (S1). The TODO(windows) blocks removed from engine_reconciler.go (manageDockerContext, removeDockerContext) and engine/controller.go (init) were correctly retired — their preconditions (npipe-aware probe client, npipe scheme in the context meta, engine controller registration on Windows) are all satisfied in this PR.


Commit Structure

Three commits map to natural boundaries: openSUSE bump (b5ab872), Windows wiring (e917657), drain race fix on slow Windows CI (0322994). The third is logically a follow-up to the second since it only became reachable after engine-controller registration on Windows. Bundling them in one PR is reasonable.


Acknowledged Limitations


Agent Performance Retro

Claude

Strongest pass on the wedge analysis (I1): traced the manager's missing ReconciliationTimeout through pkg/service/controllers/manager.go and enumerated three realistic ways drain can fail. Also produced the cleanest framing for the multi-instance pipe question — calibrating it as a doc fix (S4) rather than a critical, because the well-known pipe name is intentional. Useful suggestions on shared reason constants (S3) and the stale npipe/log wording (S1).

Codex

Found two pre-existing gaps that the other agents missed: the containers-controller standalone binary missing the engine import (I2) and the ContainerNamespace/moby startup-only application. Both required tracing outside the diff into the import graph and the watch list, which matches Codex's pattern of going deeper into wiring. Suggestion S5 caught the redundant run -0 jq plus the local BATS-style miss that the other agents partially flagged.

Gemini

Made the cache-bypass observation that gave I1 its second leg — flagged apiReader.List reading etcd rather than the watch cache, which the other two agents missed. Inflated the well-known pipe name to Critical (multi-instance and Docker Desktop collision) without weighing the deliberate well-known-name design choice; downgraded to S4 in consolidation. As usual on this repo, did not run git blame (daily quota), so could not distinguish PR-introduced from pre-existing concerns. Misread docker_watcher.go:91 as the primary path (it's the fallback after dockerEventsSince fails); the design observation was dropped from consolidation.

Summary

Claude Opus 4.7Codex GPT 5.5Gemini 3.1 Pro
Duration7m 19s9m 51s
Findings1I 4S2I 1S1I 1S
Tool calls38 (Read 15, Grep 14, Bash 9)89 (shell 88, stdin 1)
Design observations422
False positives000
Unique insights321
Files reviewed111111
Coverage misses000
Totals1I 4S2I 1S1I 1S
Downgraded01 (I→S)1 (I→S)
Dropped000

Reconciliation:

Codex contributed the highest pre-existing-gap signal (I1 wiring + the ContainerNamespace observation); Claude provided the most defensible severity calibration; Gemini contributed one decisive insight (the cache layer) and one false-positive design observation. The most valuable agent on this review depends on what you want to optimise for — coverage breadth (Codex), severity calibration (Claude), or independent-cache analysis (Gemini).


Review Process Notes

Repo context updates