Deep Review: 20260514-141313-pr-350

Date2026-05-14 14:13
Reporancher-sandbox/rancher-desktop-daemon
Round2 (of target)
Author@Nino-K
PR#350 — Wire up docker socket windows
Commits5681e67 Fixes test 104 stopping VM removes all mirror resources
e917657 Wire Docker socket forwarding for Windows
b5ab872 Bump openSUSE to v0.2.3
ReviewersClaude Opus 4.7 (effort: xhigh), Codex GPT 5.5 (effort: xhigh), Gemini 3.1 Pro (effort: default)
VerdictMerge with fixes — scope the Windows Docker pipe per-instance so RDD never attaches to Docker Desktop's daemon; gate the BATS cygpath path on is_msys so the suite does not break under WSL2. @mook-as's functional report remains unresolved (author investigating).
Wall-clock time39 min 35 s


Executive Summary

Round 2 of PR #350. The author addressed every Round 1 finding except the DockerSocket() docstring (prior-round S4), which is now subsumed by the larger pipe-ownership finding below: waitMirrorResourcesGone is bounded (30 s context.WithTimeout) and reads the informer cache (r.List), the standalone containers-controller binary blank-imports the engine package, the stale tcp/unix wording is fixed, the engine reason strings are now exported constants, and the BATS cygpath/local style nits are resolved. The remaining concerns are two Important issues that survive into Round 2 as genuinely new analysis: (1) the Windows endpoint is the global Docker pipe \\.\pipe\docker_engine, so RDD can silently attach to — and mirror/delete containers on — Docker Desktop's daemon if it is installed; and (2) the re-enabled engine-docker.bats gates an MSYS2-only cygpath call on is_windows, which is also true under WSL2, exactly the gap @mook-as raised inline. The core stop/cleanup logic (computeSettledCondition, cleanupMirrorResources, the apiReader/r.List split) is well-reasoned and well-tested at the unit level.


Critical Issues

None.


Important Issues

I1. Windows endpoint is the global Docker pipe — RDD can attach to Docker Desktop's daemon Codex Gemini

// DockerSocket returns the path to the Docker socket for this instance
// (e.g., ~/.rd2/docker.sock). This is the host-side socket that Lima
// port-forwards from the guest's /var/run/docker.sock.
// On Windows, returns the named pipe path (\\.\pipe\docker_engine).
var DockerSocket = sync.OnceValue(func() string {
	if runtime.GOOS == "windows" {
		return `\\.\pipe\docker_engine`
	}
	return filepath.Join(ShortDir(), "docker.sock")
})

// DockerEndpoint returns the full Docker endpoint URL for this instance.
// On Windows this is a named-pipe URL (npipe:////./pipe/docker_engine);
// on Unix it is a unix-socket URL (unix:///path/to/docker.sock).
var DockerEndpoint = sync.OnceValue(func() string {
	if runtime.GOOS == "windows" {
		return `npipe:////./pipe/docker_engine`
	}
	return "unix://" + DockerSocket()
})

On Unix the socket path embeds ShortDir(), which factors in RDD_INSTANCE, so instances are isolated. On Windows both functions return the global docker_engine pipe — and pkg/socketbridge/host_windows.go:20-23 documents that exact name as the one "Docker CLI and Docker Desktop use to reach dockerd."

	"github.com/Microsoft/go-winio"
	"github.com/go-logr/logr"
	"github.com/linuxkit/virtsock/pkg/hvsock"
)

const (
	// DockerPipeName is the well-known Windows named pipe that Docker CLI
	// and Docker Desktop use to reach dockerd.
	DockerPipeName = `\\.\pipe\docker_engine`

	// VsockForwardPort is the AF_VSOCK port that rdd-guest listens on inside
	// the Lima VM.  Must match the constant in cmd/rdd-guest/main.go.
	VsockForwardPort uint32 = 6660
)

The failure chain is concrete. socketbridge.HostBridge.Run calls winio.ListenPipe(b.pipeName, nil) and returns an error if the pipe is already owned (pkg/socketbridge/host_windows.go:52-56). That error is swallowed: pkg/controllers/lima/limavm/controllers/hostswitch_windows.go:199-205 runs the bridge in an errgroup goroutine that logs the failure and return nil. Meanwhile the engine controller's Docker client connects to the same endpoint (pkg/controllers/app/engine/controllers/docker_watcher.go:55 via instance.DockerEndpoint()). So with Docker Desktop installed and listening, RDD's bridge either loses the listen race or coexists as a second pipe instance, and RDD's engine client reaches Docker Desktop's daemon. fullSync then mirrors Docker Desktop's containers into RDD's API, and a K8s-side delete forwards ContainerRemove to that daemon — RDD operating destructively on a container engine it does not own. Concurrent RDD instances (RDD_INSTANCE=1 and =2) collide the same way, though the repo's single-user calibration makes the Docker Desktop collision the more pressing leg.

	}
}

// Run listens for named-pipe connections and proxies them to the guest until
// ctx is cancelled.
func (b *HostBridge) Run(ctx context.Context) error {
	ln, err := winio.ListenPipe(b.pipeName, nil)
	if err != nil {
		return fmt.Errorf("socket-bridge: listen on %s: %w", b.pipeName, err)
	}
	b.log.Info("Listening", "pipe", b.pipeName, "vsockPort", b.vsockPort)

	go func() {
		<-ctx.Done()
		ln.Close()

	// Start the host-side socket bridge now that we have the VM GUID.
	// It listens on the Docker named pipe and forwards each connection to
	// rdd-guest inside the VM via vsock port 6660.  rdd-guest is baked into
	// the VM image (via rancher-desktop-opensuse) and started by systemd.
	g.Go(func() error {
		bridge := socketbridge.NewDockerHostBridge(vmGUID, logger)
		if err := bridge.Run(ctx); err != nil {
			logger.Error(err, "Socket bridge exited with error")
		}
		return nil
	})

	// Accept vsock connections and feed them into the virtual network.
	g.Go(func() error {
		for {
			conn, err := ln.Accept()

// newDockerWatcher creates a Docker client, performs a full sync, and starts
// the event stream watcher goroutine.
func newDockerWatcher(ctx context.Context, k8s client.Client, apiNamespace string, reconcileChan chan<- event.GenericEvent) (*dockerWatcher, error) {
	cli, err := mobyclient.New(
		mobyclient.WithHost(instance.DockerEndpoint()),
	)
	if err != nil {
		return nil, fmt.Errorf("failed to create Docker client: %w", err)
	}

This is the same code Round 1 flagged as S4 (a docstring gap), but the Round 2 analysis is materially different: it is not the docstring, it is RDD silently mutating a foreign daemon. Round 1's S4 (still unaddressed — the docstring at instance.go:123 continues to say "for this instance") is subsumed here.

// length constraints.
var LimaHome = sync.OnceValue(func() string {
	return filepath.Join(ShortDir(), "lima")
})

// DockerSocket returns the path to the Docker socket for this instance
// (e.g., ~/.rd2/docker.sock). This is the host-side socket that Lima
// port-forwards from the guest's /var/run/docker.sock.
// On Windows, returns the named pipe path (\\.\pipe\docker_engine).
var DockerSocket = sync.OnceValue(func() string {
	if runtime.GOOS == "windows" {

Gemini additionally claimed the global pipe "breaks the VM boot sequence." That causal leg does not hold: the bridge starts after the VM GUID is known (hostswitch_windows.go:195-205), and @mook-as reported docker run / docker log working — so the pipe binds. His Running=False/Starting stall is a VM-side condition, not a pipe-bind failure.

	mux.Handle("/services/forwarder/expose", vn.Mux())
	mux.Handle("/services/forwarder/unexpose", vn.Mux())

	g, ctx := errgroup.WithContext(ctx)

	// Start the host-side socket bridge now that we have the VM GUID.
	// It listens on the Docker named pipe and forwards each connection to
	// rdd-guest inside the VM via vsock port 6660.  rdd-guest is baked into
	// the VM image (via rancher-desktop-opensuse) and started by systemd.
	g.Go(func() error {
		bridge := socketbridge.NewDockerHostBridge(vmGUID, logger)
		if err := bridge.Run(ctx); err != nil {
			logger.Error(err, "Socket bridge exited with error")
		}
		return nil
	})

	// Accept vsock connections and feed them into the virtual network.
	g.Go(func() error {
		for {
			conn, err := ln.Accept()

Fix: scope the pipe to the instance, mirroring the Unix isolation, and pass that name through socketbridge.NewDockerHostBridge instead of the hardcoded DockerPipeName constant. If a well-known alias is required for bare docker.exe compatibility, bind it only after proving RDD owns it, and surface the bind failure into readiness so the engine never treats an externally-owned pipe as RDD's endpoint.

 var DockerSocket = sync.OnceValue(func() string {
 	if runtime.GOOS == "windows" {
-		return `\\.\pipe\docker_engine`
+		return `\\.\pipe\` + Name() + `-docker_engine`
 	}
 	return filepath.Join(ShortDir(), "docker.sock")
 })

 var DockerEndpoint = sync.OnceValue(func() string {
 	if runtime.GOOS == "windows" {
-		return `npipe:////./pipe/docker_engine`
+		return `npipe:////./pipe/` + Name() + `-docker_engine`
 	}
 	return "unix://" + DockerSocket()
 })
I2. engine-docker.bats gates an MSYS2-only cygpath call on is_windows — breaks under WSL2 Claude Codex
    # Isolate all Docker config reads and writes from the developer's real
    # ~/.docker directory. Set DOCKER_CONFIG before starting the service so
    # the controller process inherits it (service.Start uses exec.Command
    # without an explicit Env, so it inherits the caller's environment).
    #
    # On Windows, rdd.exe is a native binary: it interprets /tmp/... as a
    # path from the drive root (C:\tmp\...) rather than MSYS2's /tmp which
    # maps to a different location. cygpath -m produces a mixed-format path
    # (C:/msys64/...) that both native Windows processes and MSYS2 agree on.
    if is_windows; then
        run -0 cygpath -m "${BATS_FILE_TMPDIR}/docker-config"
        DOCKER_CONFIG=${output}
    else
        DOCKER_CONFIG="${BATS_FILE_TMPDIR}/docker-config"
    fi
    export DOCKER_CONFIG
    mkdir -p "${DOCKER_CONFIG}"

    # Deliberately skip setup_rdd_control_plane here: `rdd set` internally
    # calls ensureServiceRunning, which is exactly the CLI path we want to
    # exercise — the engine controller is part of the default controller
    # set, so no explicit --controllers selection is needed.

is_windows() returns true for both MSYS2 and WSL2 — bats/helpers/os.bash:13-16 sets OS=windows when /proc/sys/kernel/osrelease contains WSL2, and os.bash:55 provides is_msys() precisely because the two "behave differently from WSL." cygpath is an MSYS2 tool absent under WSL2, so run -0 cygpath ... fails local_setup_file outright there. The rdd() wrapper in bats/helpers/commands.bash already shows the correct pattern — it branches is_msys (uses cygpath) versus WSL (uses wslpath plus a WSLENV adjustment). The same gap applies to DOCKER_HOST="npipe:////./pipe/docker_engine" at line 39 and the meta.json host assertion at lines 791-799: an npipe:// endpoint is meaningful only to a native docker.exe, not a Linux docker client. Before this PR the whole file was skip_on_windows, so removing that guard is what exposed the WSL2 path.

Darwin)
    # OS matches the directory name of the PATH_RESOURCES directory,
    # so uses "darwin" and not "macos".
    OS=darwin
    ;;
Linux)
    if grep --quiet WSL2 </proc/sys/kernel/osrelease; then # spellchecker:ignore
        OS=windows
    else
        OS=linux
    fi
    ;;
MSYS* | MINGW*)
    OS=windows
    fi
}

# Detect MSYS2 environment (MSYS or MINGW shells).
# Both report OS=windows, but behave differently from WSL.
is_msys() {
    [[ "${UNAME}" == MSYS* || "${UNAME}" == MINGW* ]]
}

is_unix() {
    ! is_windows "$@"
    # set, so no explicit --controllers selection is needed.
    rdd svc delete
    rdd set running=true
    if is_windows; then
        export DOCKER_HOST="npipe:////./pipe/docker_engine"
    else
        run -0 rdd svc paths docker_socket
        export DOCKER_HOST="unix://${output}"
    fi
    # Mirror resources live in App.spec.namespace. Override RDD_NAMESPACE
    # to whatever the App was created with so the test queries the same
    assert_file_exists "${meta_file}"

    run -0 jq -r '.Name' "${meta_file}"
    assert_output "${context_name}"

    if is_windows; then
        run -0 jq -r '.Endpoints.docker.Host' "${meta_file}"
        assert_output "npipe:////./pipe/docker_engine"
    else
        run -0 rdd service paths docker_socket
        socket_path=${output}
        run -0 jq -r '.Endpoints.docker.Host' "${meta_file}"
        assert_output "unix://${socket_path}"
    fi
}

@test "moby engine sets currentContext when no healthy context exists" {
    local context_name="rancher-desktop-${RDD_INSTANCE}"

This is exactly @mook-as's unresolved inline comment ("This wouldn't do the right thing if we're running BATS under WSL, right?"). It is latent rather than CI-breaking only if the project's Windows CI runs MSYS2 exclusively today.

Fix: gate the MSYS2-specific path on is_msys, and decide WSL2 behaviour explicitly — either handle wslpath + WSLENV for DOCKER_CONFIG and the DOCKER_HOST endpoint, or skip under WSL2 if that environment is genuinely unsupported for this suite.

-    if is_windows; then
+    if is_msys; then
         run -0 cygpath -m "${BATS_FILE_TMPDIR}/docker-config"
         DOCKER_CONFIG=${output}
     else
         DOCKER_CONFIG="${BATS_FILE_TMPDIR}/docker-config"
     fi

Suggestions

S1. cleanupMirrorResources comment says "the API server's watch cache" but the poll reads the informer cache Claude
		errs = append(errs, fmt.Errorf("failed to delete ContainerNamespaces: %w", err))
	}
	if err := errors.Join(errs...); err != nil {
		return err
	}
	// Block until the API server's watch cache reflects the deletions.
	// The Kubernetes watch cache is updated asynchronously: DELETE
	// requests commit to etcd synchronously, but the cache event is
	// processed in a separate goroutine. On slower systems (e.g.
	// Windows CI with SQLite I/O) there is an observable window where
	// `kubectl get` returns a just-deleted resource. Polling here
	// ensures that ContainerEngineReady is only stamped — and `rdd set
	// running=false` only returns — after the cache reflects the clean
	// state, so the immediately-following test assertion passes.
	return r.waitMirrorResourcesGone(ctx)
}

// waitMirrorResourcesGone polls the informer cache until all four mirror
// resource types are empty or the 30-second timeout expires. r.List is
// used (not r.apiReader) so the poll drains the watch cache, not etcd.

waitMirrorResourcesGone correctly documents itself as polling "the informer cache" (line 504-506), and countMirrorResources uses r.List — the controller-runtime informer cache, which is downstream of the API server's watch stream. An empty informer cache therefore implies the API server already processed the deletes, which is the stronger guarantee the code wants. The logic is right; only the cleanupMirrorResources comment is imprecise and could mislead a future reader into thinking this synchronizes against the apiserver directly.

	// running=false` only returns — after the cache reflects the clean
	// state, so the immediately-following test assertion passes.
	return r.waitMirrorResourcesGone(ctx)
}

// waitMirrorResourcesGone polls the informer cache until all four mirror
// resource types are empty or the 30-second timeout expires. r.List is
// used (not r.apiReader) so the poll drains the watch cache, not etcd.
func (r *EngineReconciler) waitMirrorResourcesGone(ctx context.Context) error {
	log := logf.FromContext(ctx)
	ctx, cancel := context.WithTimeout(ctx, 30*time.Second)
	defer cancel()
	for {

Fix: align line 492's wording — "Block until this controller's informer cache reflects the deletions" — and note that the informer cache lagging the apiserver is exactly what makes it a sufficient gate for downstream rdd ctl get callers.

S2. instance.DockerSocket() on Windows flows a named-pipe string into Lima's HOST_DOCKER_SOCKET param Claude

// DockerSocket returns the path to the Docker socket for this instance
// (e.g., ~/.rd2/docker.sock). This is the host-side socket that Lima
// port-forwards from the guest's /var/run/docker.sock.
// On Windows, returns the named pipe path (\\.\pipe\docker_engine).
var DockerSocket = sync.OnceValue(func() string {
	if runtime.GOOS == "windows" {
		return `\\.\pipe\docker_engine`
	}
	return filepath.Join(ShortDir(), "docker.sock")
})

// DockerEndpoint returns the full Docker endpoint URL for this instance.
// On Windows this is a named-pipe URL (npipe:////./pipe/docker_engine);
// on Unix it is a unix-socket URL (unix:///path/to/docker.sock).
var DockerEndpoint = sync.OnceValue(func() string {

DockerSocket() has two consumers: DockerEndpoint() (correct — it wants the pipe path) and app_controller.go:102, which formats it into the Lima template's HOST_DOCKER_SOCKET param. A Lima hostSocket is a host filesystem socket path, not a named pipe. This is currently inert because the Lima WSL2 driver advertises SkipSocketForwarding: true, so the value is never acted on — and the pre-existing Windows value was also unused, so this is not a behaviour regression. But the field is now semantically wrong, and the DockerSocket docstring's "host-side socket that Lima port-forwards" no longer holds on Windows.

	}
	return strings.Join([]string{
		baseTemplate,
		"param:",
		fmt.Sprintf("  CONTAINER_ENGINE: %s", spec.ContainerEngine.Name),
		fmt.Sprintf("  HOST_DOCKER_SOCKET: %q", instance.DockerSocket()),
		fmt.Sprintf("  HOST_HOME_GUEST: %q", toLinuxPath(hostHome)),
		fmt.Sprintf("  KUBERNETES_ENABLED: %v", spec.Kubernetes.Enabled),
		fmt.Sprintf("  KUBERNETES_VERSION: %s", spec.Kubernetes.Version),
		"",
	}, "\n"), nil

Fix: add a brief comment at app_controller.go:102 noting the param is inert on Windows (WSL2 skips socket forwarding; the pipe is bridged by hostswitch), or stop reusing DockerSocket() there so the two semantics do not share one accessor.

S3. No unit coverage for the new waitMirrorResourcesGone / countMirrorResources cleanup-synchronization logic Claude

waitMirrorResourcesGone (including the 30 s timeout fall-through that returns a distinct error) and countMirrorResources are exercised only indirectly by bats/tests/32-app-controllers/engine-docker.bats ("stopping VM removes all mirror resources"), which cannot reach the timeout branch. No EngineReconciler-level unit tests exist in this package, so the absence is consistent with repo convention — but the timeout path returns an error that propagates to rdd set as a user-visible failure, and a regression there (e.g. polling the wrong reader) would ship green. Consider a small table test for countMirrorResources against a fake client, or at minimum assert the timeout error message somewhere.

S4. BATS hardcodes the Windows endpoint string, so it will not adapt to a per-instance pipe Gemini
    # exercise — the engine controller is part of the default controller
    # set, so no explicit --controllers selection is needed.
    rdd svc delete
    rdd set running=true
    if is_windows; then
        export DOCKER_HOST="npipe:////./pipe/docker_engine"
    else
        run -0 rdd svc paths docker_socket
        export DOCKER_HOST="unix://${output}"
    fi
    # Mirror resources live in App.spec.namespace. Override RDD_NAMESPACE
    # to whatever the App was created with so the test queries the same
    # namespace the engine controller uses, regardless of CRD defaults.
    RDD_NAMESPACE=$(rdd ctl get app app -o jsonpath='{.spec.namespace}')

The test hardcodes npipe:////./pipe/docker_engine in both the local_setup_file DOCKER_HOST export and the docker_context_dir host assertion. If I1 is fixed by scoping the pipe per-instance, both sites must fetch the endpoint dynamically (e.g. from rdd svc paths docker_socket) rather than asserting the literal. Worth folding into the I1 fix so the test tracks the endpoint instead of pinning it.


Design Observations

Concerns

Strengths


Testing Assessment

Unit coverage for computeSettledCondition is thorough — the five new table cases map one-to-one onto the new branches. Gaps, ranked by risk:

  1. WSL2 BATS path (I2)local_setup_file fails before any test runs under WSL2; the suite has never been exercised there since it was skip_on_windows until this PR.
  2. waitMirrorResourcesGone 30 s timeout fall-through (S3) — the user-visible error path is unreachable from the BATS test and has no unit test.
  3. alreadyClean with the new ObservedGeneration term — no unit test confirms a stale-generation terminal condition forces cleanupMirrorResources to re-run; only integration timing covers it.
  4. npipe-scheme probe branchprobeNamedDockerContext now accepts npipe, but docker_context_test.go adds only unix:// cases; the npipe probe path is reachable only via Windows BATS. A seedContext(t, "npipe-ctx", "npipe:////./pipe/some-pipe") case would lock it down on non-Windows CI.
  5. Docker Desktop pipe collision (I1) — no test exercises RDD starting when \\.\pipe\docker_engine is already owned by another listener.

Both Claude and Codex ran go test against the touched packages and reported it passing; Codex also ran make build-rdd successfully.


Documentation Assessment


Commit Structure

Three commits, each self-contained: the openSUSE bump (b5ab872), the socket-forwarding wiring (e917657), and the test-104 cleanup fix (5681e67). The openSUSE bump is arguably unrelated to "wire up docker socket windows" and could have been its own PR, but it is a one-line-per-file digest bump and does not harm reviewability. Messages match their diffs; no fixup commits. No action needed.


Acknowledged Limitations


Unresolved Feedback


Agent Performance Retro

Claude

Strongest on diff-local detail and the test/wiring gaps: it owned three of the four suggestions — the cleanupMirrorResources comment imprecision, the DockerSocket()→Lima-HOST_DOCKER_SOCKET semantic drift (which needed tracing the second consumer of DockerSocket()), and the missing unit coverage — plus the sharpest design observation, that the engine controller now watches the app group from the containers aggregator in the standalone external binary. It also independently caught the WSL2 gate. Its notable miss is the headline: Claude reviewed instance.go and raised only the Lima-param suggestion, never connecting the global pipe to a Docker Desktop collision — the most important finding of the review, which both other agents reached.

Codex

Best analysis on the headline finding. It traced the full claim chain — from socketbridge's ListenPipe failure, through hostswitch's swallowed error, to the engine client connecting to a foreign daemon — which is what makes I1 land as a concrete data-loss risk rather than a docstring nit. It independently caught the WSL2 gate and re-raised the ContainerNamespace steady-state gap (consolidated as a design observation, consistent with Round 1). No false positives; it ran go test and make build-rdd. Zero suggestions — Codex stays on structural issues and leaves the smaller stuff to the other two.

Gemini

Converged with Codex on the pipe-ownership finding and uniquely added the concurrent-RDD-instance angle and the orphaned-hostagent pre-existing gap that explains @mook-as's instance "rd" already exists symptom. But it inflated the pipe finding to Critical with an unverified causal claim — "breaks the VM boot sequence" — that @mook-as's own report contradicts (docker run works, so the pipe binds). Its S2 was a false positive: it argued local socket_path should be restored, having misread the assignment as living in the docker_context_dir helper when it is inside an @test block. As usual, no git blame (quota), so it could not separate PR-introduced from pre-existing. Lightest coverage of the three — it left pkg/controllers/containers/controller.go out of its summary.

Summary

Claude Opus 4.7Codex GPT 5.5Gemini 3.1 Pro
Duration29m 29s10m 59s
Findings1I 3S2I1I 1S
Tool calls31 (Read 17, Bash 9, Grep 5)90 (shell 88, stdin 2)
Design observations531
False positives001
Unique insights511
Files reviewed131312
Coverage misses001
Totals1I 3S2I1I 1S
Downgraded01 (I→S)2 (I→S)
Dropped001

Reconciliation:

Codex provided the most value this round — it owned the headline finding's analysis and caught everything else of substance with no false positives. Claude contributed the most breadth on the smaller issues and the cleanest design observations, but missed the headline in a file it did review. Gemini added two useful unique angles but cost a clarification-worthy false positive and an inflated severity.


Review Process Notes

Repo context updates