Deep Review: 20260512-133636-pr-350

Date	2026-05-12 13:36
Repo	rancher-sandbox/rancher-desktop-daemon
Round	1 (of target)
Author	@Nino-K
PR	#350 — Wire up docker socket windows
Commits	`0322994` Fixes test 104 stopping VM removes all mirror resources `e917657` Wire Docker socket forwarding for Windows `b5ab872` Bump openSUSE to v0.2.3
Reviewers	Claude Opus 4.7 (effort: xhigh), Codex GPT 5.5 (effort: xhigh), Gemini 3.1 Pro (effort: default)
Verdict	Merge with fixes — bound `waitMirrorResourcesGone` to prevent reconciler wedge; register engine controller in standalone `containers-controller` binary.
Wall-clock time	`16 min 51 s`

Executive Summary ¶

The PR wires Windows Docker socket forwarding by introducing instance.DockerEndpoint() (npipe:////./pipe/docker_engine on Windows, unix://… on Unix), removes the Windows early-return guards in the engine controller and Docker-context management, accepts the npipe scheme in probeNamedDockerContext, and re-enables engine-docker.bats on Windows with cygpath/is_windows shims. The third commit fixes test 104 by tightening computeSettledCondition to require a terminal engine reason (Stopped/NotApplicable) at the current generation before declaring Settled=True, switching cleanupMirrorResources reads to apiReader, and polling until the API server reports the mirrors gone. The condition logic is correctly scoped and well-tested. The remaining concerns are: (1) the new waitMirrorResourcesGone polls an unbounded loop using a reader that bypasses the very cache it claims to wait for, and (2) the standalone containers-controller binary still does not blank-import the engine package — a pre-existing gap now made visible by Windows support.

Critical Issues ¶

None.

Important Issues ¶

I1. waitMirrorResourcesGone is unbounded and reads etcd, not the watch cache Claude Gemini¶

pkg/controllers/app/engine/controllers/engine_reconciler.go

}

// waitMirrorResourcesGone polls until all four mirror resource types
// are empty in the API server's watch cache. It returns an error if
// the context is cancelled before the list drains.
func (r *EngineReconciler) waitMirrorResourcesGone(ctx context.Context) error {
	log := logf.FromContext(ctx)
	for {
		remaining, err := r.countMirrorResources(ctx)
		if err != nil {
			return fmt.Errorf("error verifying mirror resources gone: %w", err)
		}
		if remaining == 0 {
			return nil
		}
		log.V(1).Info("Waiting for mirror resources to disappear from watch cache",
			"remaining", remaining)
		select {
		case <-ctx.Done():
			return fmt.Errorf("timed out waiting for %d mirror resources to be deleted: %w",
				remaining, ctx.Err())
		case <-time.After(200 * time.Millisecond):
		}
	}
}

// countMirrorResources returns the total number of mirror resources
// still visible in the API server for all four types.
func (r *EngineReconciler) countMirrorResources(ctx context.Context) (int, error) {
	lists := []client.ObjectList{
		&containersv1alpha1.ContainerList{},
		&containersv1alpha1.VolumeList{},
		&containersv1alpha1.ImageList{},
		&containersv1alpha1.ContainerNamespaceList{},
	}
	var total int
	for _, list := range lists {
		if err := r.apiReader.List(ctx, list, client.InNamespace(r.apiNamespace)); err != nil {
			return 0, err
		}
		items, err := meta.ExtractList(list)
		if err != nil {
			return 0, err
		}
		total += len(items)
	}
	return total, nil
}

// deleteAllOfType lists and removes every resource of the given kind
// in r.apiNamespace. Engine is authoritative for Container, Image, and
// Volume in the App's namespace: this loop deletes every object, not
// just engine-owned mirrors. Coexistence with another writer to these

Two problems compound. First, pkg/service/controllers/manager.go:143-153 sets no ReconciliationTimeout, so the ctx passed in is the manager's long-lived stop context. The loop exits only when every mirror disappears or the manager shuts down. A Container/Image/Volume created between stopWatcher() and the snapshot list in deleteAllOfType, a user-added finalizer that is not mirrorFinalizer (only that one is stripped at engine_reconciler.go:593), or a storage hiccup leaves an object visible forever. Because App is a cluster-scoped singleton with at most one in-flight Reconcile, a stuck loop blocks every subsequent state transition — rdd set running=true afterwards never runs.

pkg/service/controllers/manager.go

	if err != nil {
		return fmt.Errorf("failed to resolve passthrough port: %w", err)
	}

	// Create the shared controller-runtime manager
	managerOptions := ctrl.Options{
		Scheme: managerScheme,
		Metrics: server.Options{
			BindAddress: "127.0.0.1:" + strconv.Itoa(scm.metricsPort),
		},
		HealthProbeBindAddress: "127.0.0.1:" + strconv.Itoa(scm.healthPort),
		LeaderElection:         false, // RDD controllers are single-instance
		// Limit graceful shutdown time to avoid blocking external controller exit.
		// Default is 30s which is too long when control plane disappears.
		GracefulShutdownTimeout: ptr.To(10 * time.Second),
	}

	// Only configure webhook server if controllers require it
	if scm.hasWebhookControllers() {
		// Use instance TLS directory for webhook certificates (persistent storage)
		webhookCertDir := instance.TLSDir()

pkg/controllers/app/engine/controllers/engine_reconciler.go

				if apierrors.IsNotFound(err) {
					return nil
				}
				return err
			}
			if !controllerutil.RemoveFinalizer(latest, mirrorFinalizer) {
				return nil
			}
			return r.Update(ctx, latest)
		})
		if retryErr != nil {

Second, the loop's stated purpose (the function comment) is to "Block until the API server's watch cache reflects the deletions". apiReader is the manager's GetAPIReader() — it bypasses every cache and issues a quorum etcd read (no resourceVersion=0). DELETE commits to etcd synchronously, so apiReader.List returns 0 immediately. The user-facing rdd ctl get reads with resourceVersion=0 from the API server's watch cache, which lags etcd by the goroutine that processes the watch event. The poll exits before the cache it claims to drain is drained; the test passes today because the other fixes in this commit (the ObservedGeneration guard on alreadyClean, plus reads via apiReader in deleteAllOfType) close the actual race.

Fix: bound the wait with context.WithTimeout, and switch countMirrorResources to the controller-runtime cache (r.List) so the loop drains a cache downstream of the API server's watch cache.

 func (r *EngineReconciler) waitMirrorResourcesGone(ctx context.Context) error {
     log := logf.FromContext(ctx)
+    ctx, cancel := context.WithTimeout(ctx, 30*time.Second)
+    defer cancel()
     for {

 func (r *EngineReconciler) countMirrorResources(ctx context.Context) (int, error) {
     ...
     for _, list := range lists {
-        if err := r.apiReader.List(ctx, list, client.InNamespace(r.apiNamespace)); err != nil {
+        if err := r.List(ctx, list, client.InNamespace(r.apiNamespace)); err != nil {
             return 0, err
         }

Note: r.List reads the controller's informer cache, which is populated only for types the controller Watches. Container, Image, and Volume are watched; ContainerNamespace is not (see I2 below), so this fix should land alongside watching ContainerNamespace or accept the small risk that one ContainerNamespace slips by on the rare delete-during-stop window.

I2. Standalone containers-controller binary does not register the engine controller Codex¶

cmd/containers-controller/main.go

package main

import (
	"os"

	// Import app controller packages to trigger init() functions.
	_ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/containers"
	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/external"
)

func main() {
	os.Exit(external.RunControllers("containers"))
}

// pkg/controllers/containers/controller.go
import (
    // Import controllers to register them.
    _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/containers/container"
    _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/containers/containernamespace"
    _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/containers/image"
    _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/containers/volume"
)

The engine controller registers itself in pkg/controllers/app/engine/controller.go:25-26 via init() and declares the containers API group (pkg/controllers/app/engine/controller.go:59-60). The embedded service imports it at pkg/service/cmd/service.go:59, but the standalone containers-controller aggregator imports only the four container-CRD packages. --controllers help text at pkg/service/controllers/options.go:18 advertises containers,-engine, implying engine is part of the containers group, yet bin/containers-controller silently omits it. The repo context flags exactly this shape: "a missing import in either is a silent drop with no startup error."

pkg/controllers/app/engine/controller.go

	containersv1alpha1 "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/apis/containers/v1alpha1"
	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/app/engine/controllers"
	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/base"
)

func init() {
	base.RegisterController(newController())
}

// apiGroup is "containers" because the reconciler watches
// Container, Image, and Volume from that group and needs
// their CRDs at startup. Grouping engine with its dependencies

pkg/controllers/app/engine/controller.go


func (c *controller) GetName() string {
	return appv1alpha1.EngineControllerName
}

func (c *controller) GetAPIGroup() string {
	return apiGroup
}

func (c *controller) GetCRDData() string {
	return ""
}

pkg/service/cmd/service.go


	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/cli/help"
	// Import controller packages to trigger init() functions for embedded mode.
	_ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/app/app"
	_ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/app/demo"
	_ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/app/engine"
	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/base"
	// Import built-in system controllers.
	_ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/builtin/namespace"
	// Import controller packages to trigger init() functions for embedded mode.
	_ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/containers"

pkg/service/controllers/options.go

// Shared between the embedded service and any standalone binary that
// surfaces the same flag so the two cannot drift.
const ControllersFlagUsage = "Controllers to enable. Use '*' for all, " +
	"or a comma-separated list of controller or API-group names. " +
	"Groups: rdd, app, containers, lima. " +
	"Prefix a name with '-' to exclude it, e.g. '*,-demo' or 'containers,-engine'."

// Options holds the controller configuration options.
type Options struct {
	Controllers string // Controller selection specification (--controllers flag)
}

The gap pre-dates this PR (engine has always lived under pkg/controllers/app/engine while declaring group containers), but Windows support makes engine reachable on every platform now and amplifies the consequence: a user who runs bin/containers-controller standalone on Windows gets Container/Image/Volume CRDs but no mirroring.

Fix: blank-import the engine package from the containers aggregator so every consumer of controllers/containers gets the full group.

 import (
     // Import controllers to register them.
+    _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/app/engine"
     _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/containers/container"
     _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/containers/containernamespace"
     _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/containers/image"
     _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/containers/volume"
 )

Suggestions ¶

S1. Stale "tcp/unix" wording in probeNamedDockerContext after adding npipe Claude Codex¶

pkg/controllers/app/engine/controllers/docker_context.go

}

// probeNamedDockerContext pings the Docker endpoint for the named context
// within dockerContextProbeTimeout and returns true if the endpoint is healthy.
//
// When we cannot determine whether the context is healthy — because the store
// is unreadable, the endpoint metadata is missing or malformed, the scheme is
// not tcp/unix (e.g. SSH), or the TLS config cannot be loaded — we return true
// (assume healthy). This conservatism prevents RDD from clobbering a context
// that is in the middle of being edited or migrated.
func probeNamedDockerContext(ctx context.Context, name string) bool {
	log := logf.FromContext(ctx).WithName("docker-context").WithValues("context", name)

	s, err := newContextStore()
	if err != nil {
		log.Error(err, "Cannot open context store; assuming context healthy")
		return true
	}
	md, err := s.GetMetadata(name)
	if err != nil {
		if errdefs.IsNotFound(err) {
			return false
		}
		log.Error(err, "Cannot read context metadata; assuming context healthy")
		return true
	}
	epMeta, err := docker.EndpointFromContext(md)
	if err != nil {
		log.Error(err, "Cannot decode docker endpoint; assuming context healthy")
		return true
	}
	scheme, _, _ := strings.Cut(epMeta.Host, "://")
	switch scheme {
	case "unix", "tcp", "npipe":
	default:
		log.Info("Non-tcp/unix endpoint scheme; assuming context healthy", "scheme", scheme)
		return true
	}
	ep, err := docker.WithTLSData(s, name, epMeta)
	if err != nil {
		log.Error(err, "Cannot load TLS data; assuming context healthy")
		return true

The PR adds "npipe" to the supported schemes at line 183 but leaves both the function docstring at line 157 and the structured-log message at line 185 saying "tcp/unix". Future readers will see the disagreement; the log message also misrepresents the actual decision when an ssh:// context appears.

Fix:

-// not tcp/unix (e.g. SSH), or the TLS config cannot be loaded — we return true
+// not tcp/unix/npipe (e.g. SSH), or the TLS config cannot be loaded — we return true
 ...
-        log.Info("Non-tcp/unix endpoint scheme; assuming context healthy", "scheme", scheme)
+        log.Info("Unprobed endpoint scheme; assuming context healthy", "scheme", scheme)

S2. BATS test uses $(cygpath -m …) command substitution where repo style prefers run -0 Claude¶

bats/tests/32-app-controllers/engine-docker.bats

    #
    # On Windows, rdd.exe is a native binary: it interprets /tmp/... as a
    # path from the drive root (C:\tmp\...) rather than MSYS2's /tmp which
    # maps to a different location. cygpath -m produces a mixed-format path
    # (C:/msys64/...) that both native Windows processes and MSYS2 agree on.
    if is_windows; then
        DOCKER_CONFIG="$(cygpath -m "${BATS_FILE_TMPDIR}/docker-config")"
    else
        DOCKER_CONFIG="${BATS_FILE_TMPDIR}/docker-config"
    fi
    export DOCKER_CONFIG
    mkdir -p "${DOCKER_CONFIG}"

    # Deliberately skip setup_rdd_control_plane here: `rdd set` internally
    # calls ensureServiceRunning, which is exactly the CLI path we want to

$(cygpath -m …) swallows a non-zero exit and silently produces an empty string; mkdir -p "" then fails with a less actionable error than cygpath itself would. Repo BATS style prefers run -0 cmd …; value=${output} so command, status, and output all surface on failure.

Fix:

 if is_windows; then
-    DOCKER_CONFIG="$(cygpath -m "${BATS_FILE_TMPDIR}/docker-config")"
+    run -0 cygpath -m "${BATS_FILE_TMPDIR}/docker-config"
+    DOCKER_CONFIG=${output}
 else
     DOCKER_CONFIG="${BATS_FILE_TMPDIR}/docker-config"
 fi

S3. Engine reason strings duplicated across the app and engine packages without shared constants Claude¶

pkg/controllers/app/app/controllers/app_controller.go

		//     "Connected/M+1" condition — written while the VM was still
		//     running on an earlier reconcile that saw the spec change but
		//     not yet the stopped VM — would incorrectly satisfy the wait,
		//     causing `rdd set running=false` to return before mirror
		//     resources (Containers, Images, Volumes) are deleted.
		engineSettled := engineCond != nil &&
			engineCond.ObservedGeneration >= app.Generation &&
			(engineCond.Reason == "Stopped" || engineCond.Reason == "NotApplicable")
		if !engineSettled {
			settled.Status = metav1.ConditionFalse
			settled.Reason = v1alpha1.AppSettledReasonEngineStale
			settled.Message = settledMessageEngineStale
		} else {

// engine_reconciler.go
terminalReason := "Stopped"
...
if running && !engineIsDocker {
    terminalReason = "NotApplicable"
...
if err := r.setEngineCondition(ctx, &app, metav1.ConditionTrue, "Connected", "Container engine synced"); err != nil {

computeSettledCondition now hard-depends on the engine reconciler's choice of reason strings. A typo or rename on one side silently breaks the wait. pkg/apis/app/v1alpha1/app_types.go already exports AppConditionContainerEngineReady as a constant; the reason values deserve the same treatment.

Fix: declare EngineReasonStopped, EngineReasonNotApplicable, EngineReasonConnected, EngineReasonConnectFailed in pkg/apis/app/v1alpha1/app_types.go and reference them from both packages.

S4. instance.DockerSocket() docstring claims "for this instance" but the Windows return is process-global Claude Gemini¶

pkg/instance/instance.go

// length constraints.
var LimaHome = sync.OnceValue(func() string {
	return filepath.Join(ShortDir(), "lima")
})

// DockerSocket returns the path to the Docker socket for this instance
// (e.g., ~/.rd2/docker.sock). This is the host-side socket that Lima
// port-forwards from the guest's /var/run/docker.sock.
// On Windows, returns the named pipe path (\\.\pipe\docker_engine).
var DockerSocket = sync.OnceValue(func() string {
	if runtime.GOOS == "windows" {
		return `\\.\pipe\docker_engine`
	}
	return filepath.Join(ShortDir(), "docker.sock")
})

// DockerEndpoint returns the full Docker endpoint URL for this instance.
// On Windows this is a named-pipe URL (npipe:////./pipe/docker_engine);
// on Unix it is a unix-socket URL (unix:///path/to/docker.sock).
var DockerEndpoint = sync.OnceValue(func() string {

On Unix the path embeds ShortDir() (which factors in RDD_INSTANCE), so two instances coexist. On Windows the pipe is the well-known Docker name and is also hardcoded in pkg/socketbridge/host_windows.go:23, so two RDD instances will collide if both run, and an installed Docker Desktop already owns the same pipe. The well-known name is a deliberate choice (Docker CLI defaults work without extra setup), but the docstring should not promise per-instance isolation it does not deliver.

pkg/socketbridge/host_windows.go

)

const (
	// DockerPipeName is the well-known Windows named pipe that Docker CLI
	// and Docker Desktop use to reach dockerd.
	DockerPipeName = `\\.\pipe\docker_engine`

	// VsockForwardPort is the AF_VSOCK port that rdd-guest listens on inside
	// the Lima VM.  Must match the constant in cmd/rdd-guest/main.go.
	VsockForwardPort uint32 = 6660
)

Fix at minimum: update the docstring to reflect the Windows leg's one-instance constraint. A per-instance pipe (e.g., \\.\pipe\docker_engine_<suffix>) is out of scope here because it would break the Docker CLI's default endpoint.

-// DockerSocket returns the path to the Docker socket for this instance
-// (e.g., ~/.rd2/docker.sock). This is the host-side socket that Lima
-// port-forwards from the guest's /var/run/docker.sock.
-// On Windows, returns the named pipe path (\\.\pipe\docker_engine).
+// DockerSocket returns the path to the Docker socket. On Unix this is the
+// per-instance host-side socket that Lima port-forwards from the guest's
+// /var/run/docker.sock (e.g., ~/.rd2/docker.sock). On Windows it is the
+// well-known named pipe \\.\pipe\docker_engine — process-global, so at
+// most one RDD instance can serve Docker on Windows at a time.

S5. run -0 jq at line 790 leaves output unused on the non-Windows branch, and local appears inside an @test block Codex Gemini¶

bats/tests/32-app-controllers/engine-docker.bats

    assert_file_exists "${meta_file}"

    run -0 jq -r '.Name' "${meta_file}"
    assert_output "${context_name}"

    run -0 jq -r '.Endpoints.docker.Host' "${meta_file}"
    if is_windows; then
        assert_output "npipe:////./pipe/docker_engine"
    else
        run -0 rdd service paths docker_socket
        local socket_path=${output}
        run -0 jq -r '.Endpoints.docker.Host' "${meta_file}"
        assert_output "unix://${socket_path}"
    fi
}

@test "moby engine sets currentContext when no healthy context exists" {
    local context_name="rancher-desktop-${RDD_INSTANCE}"

The run -0 at line 790 is consumed only by the Windows branch's assert_output; on the non-Windows branch its ${output} is discarded and the same jq is re-run at line 796. local socket_path=… at line 795 sits inside an @test body, which the repo's BATS style flags (each @test runs in its own subshell already).

Fix:

-    run -0 jq -r '.Endpoints.docker.Host' "${meta_file}"
     if is_windows; then
+        run -0 jq -r '.Endpoints.docker.Host' "${meta_file}"
         assert_output "npipe:////./pipe/docker_engine"
     else
         run -0 rdd service paths docker_socket
-        local socket_path=${output}
+        socket_path=${output}
         run -0 jq -r '.Endpoints.docker.Host' "${meta_file}"
         assert_output "unix://${socket_path}"
     fi

Design Observations ¶

Concerns

ContainerNamespace/moby is created only during fullSync and never watched or re-applied (in-scope) Codex — pkg/controllers/app/engine/controllers/docker_watcher.go:615, pkg/controllers/app/engine/controllers/engine_reconciler.go:829-831. If a user deletes ContainerNamespace/moby while the engine watcher is alive, no watch event reaches the reconciler (the steady-state watch list omits the type) and no Docker event recreates it, so the mirror namespace stays missing until the next watcher restart. The repo context names this exact anti-pattern. The "deleting ContainerNamespace/moby completes without a finalizer hang" test today verifies the namespace stays gone after a user delete, so the behaviour is at least partly intentional — but the reconciler-side gap deserves an explicit comment ("startup-only, not re-applied on steady-state") or a Watches(&containersv1alpha1.ContainerNamespace{}, enqueueApp) plus an ensureContainerNamespace step.

Strengths

computeSettledCondition precisely closes the Connected-at-current-generation hole (in-scope) Claude Codex — pkg/controllers/app/app/controllers/app_controller.go:434-463. The engine reconciler can stamp Connected at M+1 between the spec patch and the LimaVM running flip; before this PR the !desiredRunning branch returned Settled=True unconditionally and let rdd set running=false exit before mirror cleanup. Requiring a terminal reason at the current generation closes that window without overreach, and the new unit-test cases (engine still Connected at current generation waits, engine ready at older generation is stale, NotApplicable settles) cover each branch.
Switching cleanup reads to apiReader (in-scope) Claude — pkg/controllers/app/engine/controllers/engine_reconciler.go:539, 565, 587. The controller-runtime informer cache lags writes by the watcher goroutine; using the direct API reader in deleteAllOfType and countMirrorResources makes the cleanup see resources the watcher applied seconds before stopWatcher() killed it.
Centralizing the Docker client address in instance.DockerEndpoint() (in-scope) Codex — pkg/instance/instance.go:134-142. Each call site (engine_reconciler.go:315, docker_watcher.go:55) used to stitch "unix://" + DockerSocket(); the helper hides the Windows npipe switch behind a single URL.

Testing Assessment ¶

Test_computeSettledCondition is comprehensive for the new branches: every transition added in the third commit has a case. Two gaps remain. First, no unit test exercises the npipe scheme branch in probeNamedDockerContext; that path is reachable only via Windows BATS today. Seeding a seedContext(t, "npipe-ctx", "npipe:////./pipe/some-pipe") case in Test_probeNamedDockerContext would lock it down on non-Windows CI. Second, no test covers I1's wedge scenario — a unit test that injects a fake apiReader returning a non-empty list and verifies waitMirrorResourcesGone honours a context deadline would make the bound regression-safe. Third, neither cygpath BATS shim nor the npipe assert_output is exercised on non-Windows CI; if the Windows runner is intermittent, the Windows-only branches drift.

Documentation Assessment ¶

The new instance.DockerEndpoint() is documented well. The pre-existing DockerSocket doc is now misleading on Windows (S4) and the probeNamedDockerContext doc and log message drift from the code (S1). The TODO(windows) blocks removed from engine_reconciler.go (manageDockerContext, removeDockerContext) and engine/controller.go (init) were correctly retired — their preconditions (npipe-aware probe client, npipe scheme in the context meta, engine controller registration on Windows) are all satisfied in this PR.

Commit Structure ¶

Three commits map to natural boundaries: openSUSE bump (b5ab872), Windows wiring (e917657), drain race fix on slow Windows CI (0322994). The third is logically a follow-up to the second since it only became reachable after engine-controller registration on Windows. Bundling them in one PR is reasonable.

Acknowledged Limitations ¶

engine_reconciler.go:206-213 already documents the watcher-died vs. never-started branch and the deliberate dual-source approach (stopWatcher vs. direct removeDockerContext). Not an issue.
engine_reconciler.go:259-266 documents that the steady-state setEngineCondition("Connected", …) is the only place that re-stamps the current generation when the spec is patched without cycling the watcher. The new generation guard at line 238 leans on exactly that lifecycle.
pkg/controllers/lima/limavm/controllers/hostswitch_windows.go:151-155 documents fire-and-forget host-switch lifecycle: failures are logged but not surfaced to the reconciler. Pre-existing and not made worse by this PR.

Agent Performance Retro ¶

Claude ¶

Strongest pass on the wedge analysis (I1): traced the manager's missing ReconciliationTimeout through pkg/service/controllers/manager.go and enumerated three realistic ways drain can fail. Also produced the cleanest framing for the multi-instance pipe question — calibrating it as a doc fix (S4) rather than a critical, because the well-known pipe name is intentional. Useful suggestions on shared reason constants (S3) and the stale npipe/log wording (S1).

Codex ¶

Found two pre-existing gaps that the other agents missed: the containers-controller standalone binary missing the engine import (I2) and the ContainerNamespace/moby startup-only application. Both required tracing outside the diff into the import graph and the watch list, which matches Codex's pattern of going deeper into wiring. Suggestion S5 caught the redundant run -0 jq plus the local BATS-style miss that the other agents partially flagged.

Gemini ¶

Made the cache-bypass observation that gave I1 its second leg — flagged apiReader.List reading etcd rather than the watch cache, which the other two agents missed. Inflated the well-known pipe name to Critical (multi-instance and Docker Desktop collision) without weighing the deliberate well-known-name design choice; downgraded to S4 in consolidation. As usual on this repo, did not run git blame (daily quota), so could not distinguish PR-introduced from pre-existing concerns. Misread docker_watcher.go:91 as the primary path (it's the fallback after dockerEventsSince fails); the design observation was dropped from consolidation.

Summary ¶

	Claude Opus 4.7	Codex GPT 5.5	Gemini 3.1 Pro
Duration	7m 19s	9m 51s	—
Findings	1I 4S	2I 1S	1I 1S
Tool calls	38 (Read 15, Grep 14, Bash 9)	89 (shell 88, stdin 1)	—
Design observations	4	2	2
False positives	0	0	0
Unique insights	3	2	1
Files reviewed	11	11	11
Coverage misses	0	0	0
Totals	1I 4S	2I 1S	1I 1S
Downgraded	0	1 (I→S)	1 (I→S)
Dropped	0	0	0

Reconciliation:

Gemini C1 "Global named pipe on Windows breaks multi-instance isolation": critical → suggestion S4. The well-known pipe is a deliberate design choice for Docker CLI compatibility; this is a documentation gap (the docstring still says "for this instance"), not a critical bug.
Codex I2 "ContainerNamespace not watched or re-applied": important → design observation. The "deleting ContainerNamespace/moby completes without a finalizer hang" BATS test deliberately verifies the namespace stays gone after a user delete, so the behavior is partly intentional; surfaced as a design observation with a recommendation to add an explanatory comment or a Watches entry.

Codex contributed the highest pre-existing-gap signal (I1 wiring + the ContainerNamespace observation); Claude provided the most defensible severity calibration; Gemini contributed one decisive insight (the cache layer) and one false-positive design observation. The most valuable agent on this review depends on what you want to optimise for — coverage breadth (Codex), severity calibration (Claude), or independent-cache analysis (Gemini).

Review Process Notes ¶

Repo context updates ¶

The Suggestions section already contains a rule about reason-string constants for App.Status.Conditions writers (S3-style); consider adding a standing rule to deep-review-context.md: When multiple controllers write to the same status condition and a downstream consumer compares the Reason field by string literal, the reason values must live as exported constants in the API-types package. Flag string-literal reason comparisons across package boundaries.
Add a standing rule for waitForGone/polling-loop patterns: A polling loop that waits for a resource to disappear "from the cache" must (a) carry an explicit context.WithTimeout bound and (b) read from a cache downstream of the one it claims to drain. Reading via mgr.GetAPIReader() queries etcd directly and bypasses the API server's watch cache; pair that reader with the informer-cache r.List for the wait, or document why the etcd read is sufficient.