Deep Review: 20260415-015141-pr-309

Date	2026-04-15 01:51
Repo	rancher-sandbox/rancher-desktop-daemon
Round	4 (of PR #309)
Author	@jandubois
PR	#309 — engine: mirror Docker container engine state into Kubernetes
Branch	`engine-v3`
Commits	`a5235c8` engine: mirror Docker container engine state into Kubernetes
Reviewers	Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro
Verdict	Merge with fixes — three important regressions introduced by prior-round fixes: a discovery ordering race that re-opens the `rdd set running=true` gap on cold start, engine cleanup that deletes unowned mirror resources, and the standalone `app-controller` binary importing `engine` under the wrong manager group.
Wall-clock time	`16 min 6 s`

Executive Summary ¶

Round 4 of PR #309 reviews the fixes applied after round 3. Four of the five prior findings were addressed cleanly (nil-pointer guards on inspect.Config, engine controller moved to the containers API group, webhook dead code removed, watcher context decoupled via context.WithoutCancel). One was skipped by design (namespace self-healing) and one remains a standing style disagreement (BATS $(...) substitution at file scope).

This round surfaces three new Important regressions, each a direct consequence of a round-3 fix interacting with the rest of the control plane. (1) rdd.rancherdesktop.io/ready=true now fires before registerDiscovery populates the enabled-controller list, so on cold start isEngineControllerEnabled sees an empty list and rdd set running=true silently falls through to the weaker Running=True wait — exactly reintroducing the race the PR aims to close. (2) cleanupMirrorResources and the full-sync prune paths operate on every Container/Image/Volume in rancher-desktop, not only on engine-owned mirrors, which forced the containers-mock.bats suite to spell containers,-engine to keep the two controllers from fighting over shared kinds. (3) cmd/app-controller/main.go now imports engine, but pkg/external/main.go does not filter by GetAPIGroup(), so a standalone app-controller run can register a containers-group reconciler under the app shared manager without the containers CRDs it needs.

Four lower-severity items cover a latent watcher-goroutine leak on manager shutdown (the context.WithoutCancel fix from round 3 has no shutdown hook), missing teardown cleanup in engine.bats, a stylistic errs := []error{applyErr} seed, and stale --controllers help text that now omits the containers group. Gemini filed a Critical "invalid @test syntax" finding against engine.bats that is a hallucination — the file uses @test correctly — and an Important "API-group mismatch" claim whose cause-and-effect trace is backwards from what pkg/external/main.go actually does.

Structure: 0 critical, 3 important, 4 suggestions.

Critical Issues ¶

None.

Important Issues ¶

I1. rdd set running=true cold-start race: ReadyAnnotation fires before the enabled-controller list is populated Codex Claude¶

pkg/service/controllers/manager.go

	log.Info("Installing CRDs for all controllers in parallel", "controllers", len(scm.registrations))
	if err := scm.installControllerCRDs(ctx); err != nil {
		return fmt.Errorf("failed to install controller CRDs: %w", err)
	}

	// Configure controller-runtime to use klog
	ctrllog.SetLogger(log.WithName(fmt.Sprintf("controller-runtime-%s", scm.name)))

	// Create and register scheme with required types
	managerScheme := runtime.NewScheme()
	if err := scheme.AddToScheme(managerScheme); err != nil {
		return fmt.Errorf("failed to add core scheme: %w", err)
	}
	// Add CRD types to support dynamic resource discovery
	if err := apiextensionsv1.AddToScheme(managerScheme); err != nil {
		return fmt.Errorf("failed to add apiextensions scheme: %w", err)
	}

SharedControllerManager.Start installs CRDs, then sets ReadyAnnotation, and only later — at manager.go:202 — calls registerDiscovery, which is what writes the EnabledControllers list into configMap.Data.

pkg/service/controllers/manager.go


	// Mark the control plane ready after registerDiscovery so clients
	// waiting on the ready annotation see both CRDs installed and
	// controller registration written, not just CRDs.
	if err := MarkControlPlaneReady(ctx, scm.discovery.client); err != nil {
		return fmt.Errorf("failed to mark control plane as ready: %w", err)
	}

	// Ensure cleanup on shutdown with a timeout to avoid blocking if apiserver is gone
	defer func() {
		cleanupCtx, cleanupCancel := context.WithTimeout(context.Background(), 5*time.Second)

// set.go:490-505
engineEnabled, err := isEngineControllerEnabled(ctx, config)
if err != nil {
    logrus.WithError(err).Debug("Falling back to ContainerEngineReady wait after discovery lookup failed")
    engineEnabled = true
}
if engineEnabled {
    logrus.Info("Waiting for container engine to be ready")
    return watchCondition(ctx, config, func(obj *unstructured.Unstructured) bool {
        status, gen, present := conditionInfo(obj, "ContainerEngineReady")
        return present && gen >= minGen && status == metav1.ConditionTrue
    })
}
logrus.Info("Waiting for the VM to start (engine controller not enabled)")
return watchCondition(ctx, config, func(obj *unstructured.Unstructured) bool {
    status, gen, present := conditionInfo(obj, "Running")
    return present && gen >= minGen && status == metav1.ConditionTrue
})

rdd set running=true calls waitForDesiredState → isEngineControllerEnabled, which reads configMap.Data via GetEnabledControllers (discovery.go:270-289) and returns (false, nil) for a ConfigMap with no data entries. In the window between MarkControlPlaneReady and registerDiscovery, the predicate is: ready annotation set, but EnabledControllers empty. rdd svc start --wait and cmd/rdd/service.go already wait on the ready annotation, so rdd set proceeds into the wait, sees engineEnabled=false, and falls through to the weaker Running=True wait.

pkg/service/controllers/discovery.go

	return &info, nil
}

// GetEnabledControllers returns the list of all enabled controllers, across all
// controller managers.  Note that the returned controllers may not be running.
func (d *ControllerManagerDiscovery) GetEnabledControllers(ctx context.Context) ([]string, error) {
	configMap, err := d.client.CoreV1().ConfigMaps(d.namespace).Get(ctx, ControllerManagerConfigMapName, metav1.GetOptions{})
	if errors.IsNotFound(err) {
		return nil, nil // No controller manager running
	}
	if err != nil {
		return nil, fmt.Errorf("failed to discover controller manager: %w", err)
	}

	enabledControllers := make([]string, 0) // non-nil: distinguishes "0 controllers" from "ConfigMap not found"
	for _, serializedInfo := range configMap.Data {
		var info ControllerManagerInfo
		if err := json.Unmarshal([]byte(serializedInfo), &info); err != nil {
			return nil, fmt.Errorf("failed to parse controller manager info: %w", err)
		}

		enabledControllers = append(enabledControllers, info.EnabledControllers...)
	}

	return enabledControllers, nil
}

// IsControllerRunning checks if a specific controller is running in any of the
// shared controller managers.
func (d *ControllerManagerDiscovery) IsControllerRunning(ctx context.Context, controllerName string) (bool, *ControllerManagerInfo, error) {

Running=True is satisfied by LimaVM reaching Running — well before the engine controller has connected to Docker, run fullSync, and stamped ContainerEngineReady=True. That is exactly the race this PR is designed to close. engine.bats never catches it because its tests either start the service with rdd set running=true against an already-ready control plane or run after startup has fully settled.

Codex originally traced this as Important; Claude independently flagged the same ordering (as S1 "isEngineControllerEnabled queries discovery before controller managers register") and proposed defaulting to ContainerEngineReady when the controller list is empty-during-startup. We keep the Important classification because the failure mode reintroduces a ship-blocking behaviour from before the PR.

Fix options, in order of preference:

Move registerDiscovery before MarkControlPlaneReady in SharedControllerManager.Start. The enabled-controller list is then visible to every client that waits for the ready annotation. This is the minimal change and preserves the "ready = clients may use the control plane" contract.
If keeping readiness independent of controller-manager health is load-bearing, precompute the enabled-controller list into the discovery ConfigMap (or a sibling key) before MarkControlPlaneReady, and have isEngineControllerEnabled read that precomputed record.
As a defensive change on top, let isEngineControllerEnabled distinguish "empty list" from "registered with zero engine" and default to the stricter ContainerEngineReady wait when the list is empty. This alone is not sufficient — it papers over the ordering — but it is a belt-and-braces safeguard.

I2. Engine cleanup and full-sync prune delete every Container/Image/Volume in rancher-desktop, not only engine-owned mirrors Codex¶

pkg/controllers/app/engine/controllers/engine_reconciler.go

	}
	if err := r.deleteAllOfType(ctx, &containersv1alpha1.ImageList{}); err != nil {
		errs = append(errs, fmt.Errorf("failed to delete Images: %w", err))
	}
	if err := r.deleteAllOfType(ctx, &containersv1alpha1.ContainerNamespaceList{}); err != nil {
		errs = append(errs, fmt.Errorf("failed to delete ContainerNamespaces: %w", err))
	}
	return errors.Join(errs...)
}

// deleteAllOfType lists and removes every resource of the given kind
// in apiNamespace. Engine is authoritative for Container, Image, and
// Volume in `rancher-desktop`: this loop deletes every object, not
// just engine-owned mirrors. Coexistence with another writer to these
// kinds requires an engine-owned label and matching filters here and
// in the sync_*.go full-sync prune paths. Finalizer removal retries
// on conflict; per-item errors are collected so one stuck object does
// not block the rest.
func (r *EngineReconciler) deleteAllOfType(ctx context.Context, list client.ObjectList) error {
	if err := r.List(ctx, list, client.InNamespace(apiNamespace)); err != nil {
		return err
	}

	items, err := meta.ExtractList(list)
	if err != nil {
		return err
	}

	var errs []error
	for _, item := range items {
		obj := item.(client.Object)
		key := client.ObjectKeyFromObject(obj)
		// meta.ExtractList leaves each item's TypeMeta empty (GVK is
		// only on the list), so format the Go type with %T instead of
		// obj.GetObjectKind().
		retryErr := retry.RetryOnConflict(retry.DefaultRetry, func() error {
			latest := obj.DeepCopyObject().(client.Object)
			if err := r.Get(ctx, key, latest); err != nil {
				if apierrors.IsNotFound(err) {
					return nil
				}
				return err
			}
			if !controllerutil.RemoveFinalizer(latest, mirrorFinalizer) {
				return nil
			}
			return r.Update(ctx, latest)
		})
		if retryErr != nil {
			errs = append(errs, fmt.Errorf("failed to remove finalizer from %T %s: %w",
				obj, obj.GetName(), retryErr))
			continue
		}
		// Delete uses the stale list item, not latest from the retry

When cleanupMirrorResources runs (App stopped, backend switched, controller teardown), deleteAllOfType lists every object of a given kind in the rancher-desktop namespace. For each object, it enters a retry closure that removes the engine mirrorFinalizer if present — but the closure exits with nil on "finalizer not present" (line 319). The outer loop then falls through to r.Delete(ctx, obj) unconditionally at line 331. An object created by any other controller in rancher-desktop, lacking the engine finalizer, is deleted anyway.

The full-sync prune paths have the same shape:

// sync_containers.go:62-75
var containerMirrors containersv1alpha1.ContainerList
if err := w.k8s.List(ctx, &containerMirrors, client.InNamespace(apiNamespace)); err != nil {
    return fmt.Errorf("failed to list Containers: %w", err)
}
for i := range containerMirrors.Items {
    c := &containerMirrors.Items[i]
    if !dockerIDs[c.Name] {
        log.V(1).Info("Removing stale Container", "id", c.Name)
        if err := w.removeMirrorResource(ctx, c, c.Name); err != nil {
            errs = append(errs, err)
        }
    }
}

Any Container whose metadata.name does not correspond to a live Docker container ID is pruned, regardless of whether it was created by the engine controller. sync_images.go:220-239 and sync_volumes.go apply the same logic to their kinds. The fallout is already visible in this PR: bats/tests/34-containers-controllers/containers-mock.bats had to switch to --controllers=containers,-engine so that the mock container controller keeps exclusive ownership of the mirror kinds while its tests run. That is a test-only workaround for an architectural problem: two controllers in the same group both write authoritative state into the same namespace and the same kinds, with no ownership marker distinguishing their objects.

pkg/controllers/app/engine/controllers/sync_images.go


	freshNames, applyErr := w.applyImageFromInspect(ctx, result.InspectResponse)
	keep := make(map[string]bool, len(freshNames))
	for _, n := range freshNames {
		keep[n] = true
	}

	var images containersv1alpha1.ImageList
	if err := w.k8s.List(ctx, &images, client.InNamespace(apiNamespace)); err != nil {
		return errors.Join(applyErr, err)
	}
	var errs []error
	if applyErr != nil {
		errs = append(errs, applyErr)
	}
	for i := range images.Items {
		img := &images.Items[i]
		if img.Status.ID != result.InspectResponse.ID {
			continue
		}
		if keep[img.Name] {
			continue
		}
		if err := w.removeMirrorResource(ctx, img, img.Name); err != nil {
			errs = append(errs, err)
		}
	}
	return errors.Join(errs...)
}

Fix: stamp engine-owned objects with a distinguishing label or annotation (e.g. engine.rancherdesktop.io/owned-by=engine-controller) in addition to the finalizer, and filter both deleteAllOfType and every full-sync prune by that marker. Unmarked objects must be skipped entirely so other controllers — the mock used by containers-mock.bats today, any future containerd mirror, or a human-created test fixture — can coexist in the same namespace without losing state on a VM stop or backend switch. Deleting by marker also removes the need for the -engine carve-out in containers-mock.bats.

Two earlier rounds discussed the related "namespace self-healing" question and decided not to re-create rancher-desktop / containernamespace/moby after a user deletes them mid-run. That decision is distinct from this one: self-healing is about re-creating your own objects, while this finding is about not destroying someone else's.

I3. cmd/app-controller imports the engine package, but external.RunControllers does not filter by API group — the standalone binary can register engine under the wrong shared manager Codex¶

cmd/app-controller/main.go


import (
	"os"

	// Import app controller packages to trigger init() functions.
	_ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/app/app"
	_ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/app/demo"
	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/external"
)

func main() {
	os.Exit(external.RunControllers("app"))
}

cmd/app-controller/main.go now blank-imports engine to trigger its init(). The intent of round 3 was the opposite: engine declares itself in the containers API group (pkg/controllers/app/engine/controller.go:32-38) precisely so --controllers=app selections cannot pull it in without its CRDs.

pkg/controllers/app/engine/controller.go

		return
	}
	base.RegisterController(newController())
}

const (
	controllerName = "engine"
	// apiGroup is "containers" because the reconciler watches
	// Container, Image, and Volume from that group and needs
	// their CRDs at startup. Grouping engine with its dependencies
	// keeps --controllers selections from splitting the two apart.
	apiGroup = "containers"
)

type controller struct{}

func newController() base.Controller {

// pkg/external/main.go:62-72
for _, controller := range base.GetAllControllers() {
    shouldStart, err := shouldStartController(ctx, config, controller.GetName(), setupLog)
    // ...
    } else if shouldStart {
        controllersToStart = append(controllersToStart, controller)
    }
}

RunControllers iterates the global registry and filters only by shouldStartController, which checks whether the controller is already running in another manager. It never compares controller.GetAPIGroup() to the apiGroupName passed in at line 31. If the embedded control plane is running with engine disabled (--controllers=-engine), or if engine has not registered itself in discovery yet, shouldStartController returns true, and cmd/app-controller adds engine to its controllersToStart list. Engine is then registered into the shared manager named "app" at line 99 — not "containers" — which is exactly the scenario round 3 tried to prevent. Once in, engine's SetupWithManager watches Container/Image/Volume from the containers group; if those CRDs are not installed (they are not installed by app-controller itself — demo and app have no containers dependency), the watch caches fail to establish and the manager startup errors out.

Gemini filed a related finding but reversed the effect: it claimed external.RunControllers("app") filters by API group and would therefore silently omit engine. That would be a safer outcome than what the code actually does, but RunControllers does not filter. The engine controller is imported and registered under the wrong shared manager.

Fix options, in order of preference:

Drop the _ "...controllers/app/engine" import from cmd/app-controller/main.go. The blank import contributes nothing the standalone binary needs now that engine belongs to a different group, and the other two app-group controllers continue to work. This is a one-line change.
In pkg/external/main.go, add an if controller.GetAPIGroup() != apiGroupName { continue } guard to the controllers-to-start loop. This turns the filtering into a systemic guardrail: a future mistaken import in any standalone binary is prevented from cross-registering into the wrong manager. Pair with (1) for belt-and-braces.
If the intent is to eventually run engine from a containers-specific standalone binary (cmd/containers-controller), add that binary and move the engine import there instead.

Suggestions ¶

S1. context.WithoutCancel leaks the watcher goroutine on manager shutdown Claude¶

pkg/controllers/app/engine/controllers/engine_reconciler.go

	//
	// Run both passes unconditionally and join their errors. Early
	// return on specsErr would stall every mirror's finalizer behind
	// a single stuck spec.state actuation: the next reconcile would
	// hit the same broken container first and skip finalizers again.
	var specsErr, finErr error
	if err := r.reconcileContainerSpecs(ctx); err != nil {
		specsErr = fmt.Errorf("failed to reconcile container specs: %w", err)
	}
	if err := r.processFinalizers(ctx); err != nil {
		finErr = fmt.Errorf("failed to process finalizers: %w", err)
	}
	return ctrl.Result{}, errors.Join(specsErr, finErr)
}

// startWatcher creates and starts a Docker watcher goroutine. The
// watcher inherits watcherCtx, which cancels only on manager
// shutdown; startWatcher deliberately drops Reconcile's ctx so a
// future ReconciliationTimeout or per-request deadline cannot kill
// the watcher the moment Reconcile returns.
func (r *EngineReconciler) startWatcher(_ context.Context) error {
	r.watcherMu.Lock()
	defer r.watcherMu.Unlock()

	if r.watcher != nil {
		return nil
	}

	w, err := newDockerWatcher(r.watcherCtx, r.Client, r.reconcileChan)
	if err != nil {
		return err
	}

The round-3 fix for "watcher bound to Reconcile context" correctly prevents per-request cancellation from killing the watcher, but it also prevents manager-shutdown cancellation from reaching the watcher. The only shutdown path for the watcher is stopWatcher, which is called from Reconcile. On manager shutdown, controller-runtime cancels the manager context, Reconcile stops being invoked, and stopWatcher never runs. The watcher goroutine stays alive — and holds the Docker client connection open — until the process exits moments later.

Practical impact is small: GracefulShutdownTimeout on the manager is 10 s (manager.go:161), process exit follows quickly, and the watcher goroutine dies with the process. But cli.Close() never runs, the watcher's shutdown logs are missing, and any future refactor that wants ordered cleanup (flush pending events, fsync metadata) would have no hook to plug into. Claude rated this Important; we classify as Suggestion given the narrow shutdown-only impact, while noting it regresses the explicit intent of the round-3 fix.

pkg/service/controllers/manager.go

	if scm.hasWebhookControllers() {
		// Use instance TLS directory for webhook certificates (persistent storage)
		webhookCertDir := instance.TLSDir()

		opts := webhook.Options{
			Host:     "127.0.0.1",
			Port:     scm.webhookPort,
			CertDir:  webhookCertDir,
			CertName: fmt.Sprintf("webhook-%s.crt", scm.name),
			KeyName:  fmt.Sprintf("webhook-%s.key", scm.name),
		}

Fix: register a manager.RunnableFunc with the shared manager in RegisterWithManager that listens on the manager context and calls stopWatcher on cancel. Or store the watchCancel alongside watcherMu and cancel it from a mgr.Add(manager.RunnableFunc(...)) runnable. Either pattern gives the watcher a proper shutdown hook while keeping its lifetime independent of any individual Reconcile ctx.

S2. engine.bats has no local_teardown_file — failed tests leak Docker containers/volumes into the next run Claude¶

bats/tests/32-app-controllers/engine.bats

# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: SUSE LLC
# SPDX-FileCopyrightText: The Rancher Desktop Authors

load '../../helpers/load'

The new suite creates named Docker objects (test-lifecycle, test-state, test-delete, test-inuse, test-unpause, dangling-pin, dangling-tag-test:v1, test-vol, My_Vol_Ume) across many tests. A failure mid-suite leaves them running; subsequent runs error out with container name "test-lifecycle" is already in use. The rdd svc delete in local_setup_file destroys the VM and its Docker state, which is what catches this today — but only because the suite always performs a full VM recreate on every run. Dropping that rdd svc delete (an obvious future optimisation for test speed) would reintroduce the leaks immediately.

Fix: add a local_teardown_file that force-removes the named test objects:

local_teardown_file() {
    docker rm --force test-lifecycle test-state test-delete test-inuse \
        test-unpause dangling-pin 2>/dev/null || true
    docker rmi dangling-tag-test:v1 2>/dev/null || true
    docker volume rm test-vol My_Vol_Ume 2>/dev/null || true
}

The suite is currently self-cleaning only because of the VM-recreate in local_setup_file, which is a load-bearing side effect worth replacing with explicit teardown.

S3. errs := []error{applyErr} pre-seeds the slice with a possibly-nil error Claude¶

pkg/controllers/app/engine/controllers/sync_images.go


	var images containersv1alpha1.ImageList
	if err := w.k8s.List(ctx, &images, client.InNamespace(apiNamespace)); err != nil {
		return errors.Join(applyErr, err)
	}
	var errs []error
	if applyErr != nil {
		errs = append(errs, applyErr)
	}
	for i := range images.Items {
		img := &images.Items[i]

errors.Join at line 239 discards nils, so the behaviour is correct. The pattern reads as if every call returns at least one error, and diverges from the more common var errs []error; if applyErr != nil { errs = append(errs, applyErr) } shape used elsewhere in the same package. Pure style; change only if you are in the file anyway.

pkg/controllers/app/engine/controllers/sync_images.go

		}
		if keep[img.Name] {
			continue
		}
		if err := w.removeMirrorResource(ctx, img, img.Name); err != nil {
			errs = append(errs, err)
		}
	}
	return errors.Join(errs...)
}

S4. --controllers help text does not list the new containers group Codex¶

pkg/service/controllers/options.go

}

// CompletedOptions holds the completed controller configuration options.
type CompletedOptions struct {
	*completedOptions
}

// AddFlags adds the flags for the controller options to the given FlagSet.
func (o *Options) AddFlags(fs *pflag.FlagSet) {
	if o == nil {
		return

The round-3 fix moved engine from the app group to the containers group, but both help strings (options.go:32 and the equivalent line in cmd/rdd/service.go:174) still only document rdd and app. A user who reads --help has no indication that -engine, containers,-engine, or containers are valid selectors — and this PR introduces a containers-mock.bats test that already relies on the containers,-engine form.

pkg/service/cmd/options/options.go

	"k8s.io/client-go/util/keyutil"
	cliflag "k8s.io/component-base/cli/flag"
	"k8s.io/klog/v2"
	controlplaneapiserveroptions "k8s.io/kubernetes/pkg/controlplane/apiserver/options"
	"k8s.io/kubernetes/pkg/features"
	kubeoptions "k8s.io/kubernetes/pkg/kubeapiserver/options"
	"k8s.io/kubernetes/pkg/serviceaccount"

	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/instance"
	rddadmission "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/service/admission"
	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/service/controllers"

cmd/rdd/service.go

	command := &cobra.Command{
		Use:  "create",
		Long: "Create RDD control plane",
		RunE: serviceCreateAction,
	}

	command.Flags().String("controllers", "*", controllers.ControllersFlagUsage)
	command.Flags().Int("secure-port", 0, "The port on which to serve HTTPS with authentication and authorization (default: 6443 + instance index)")
	if !developer.Mode() {
		_ = command.Flags().MarkHidden("controllers")
		_ = command.Flags().MarkHidden("secure-port")

Fix: update both help strings to list every registered API group, including containers and lima. A structural fix would derive the help text from the registered controllers at startup so it cannot drift again; a minimal fix is to just edit both strings and add a test that every selector used in bats/ appears in the help.

Design Observations ¶

Strengths ¶

retry.RetryOnConflict + re-Get on every multi-writer status write (in-scope) Claude Codex Gemini — AppReconciler mirroring LimaVM conditions and EngineReconciler.setEngineCondition both wrap their App-status updates in retry + re-Get. Controller-runtime serialises Reconcile within a controller but not across controllers, so this is the correct response to two reconcilers writing different conditions on the same object; the pattern is applied consistently and matches removeMirrorResource and deleteAllOfType.
Daemon-clock anchoring for Docker event replay (in-scope) Codex — dockerEventsSince uses Info.SystemTime rather than host time.Now() for the event replay filter, which avoids dropping events inside a host/guest clock-skew window. The biased host-clock fallback on Info failure is a sensible safety net.
Dual SSA field managers for spec vs finalizer apply (in-scope) Claude — engine-controller owns spec; engine-controller-finalizer owns the finalizer-only apply. A finalizer-only SSA through the main field owner would release ownership of spec and fail the CRD's required-field validation. The pattern is non-obvious enough that the code comment documenting it is the right move.
Stable sort before atomic-list apply (in-scope) Gemini — RepoDigests, ContainerPort bindings, and port names are sorted before every apply, so Go's randomised map iteration does not mint a new resourceVersion on each sync. Matches the atomic-list SSA semantics the CRD declares.
Idempotent fullSync preserving mirror identities across watcher reconnects (in-scope) Claude — Container IDs, image hashes, and volume hashes are stable across reconnects, so transient Docker disconnects cause no downstream churn. The combination with the dead-watcher detection + reconnect path handles daemon restarts cleanly.

Concerns ¶

Engine controller is a namespace-scanner, not an owned-object reconciler (in-scope) Codex — reconcileContainerSpecs, processFinalizers, cleanupMirrorResources, and all three full-sync prune paths list every object of a kind in rancher-desktop and act on them regardless of ownership. Covered in I2 as the immediate concrete bug; the deeper concern is architectural. The code comment at engine_reconciler.go:181-185 names per-resource reconcilers with watch predicates as the long-term fix, and pairing that with an ownership label would eliminate both the O(N)-per-event cost and the cross-controller interference in one change.
Watcher shutdown hook missing from the context.WithoutCancel rewrite (in-scope) Claude — Covered as S1. The round-3 fix decoupled the watcher from individual Reconcile calls but left no manager-level cancellation path. A manager.RunnableFunc registered in RegisterWithManager would complete the lifecycle.
Per-reconcile List cost for specs and finalizers (future) Claude — Already flagged in prior rounds and acknowledged in code comments. Becomes load-bearing once mirror counts exceed ~100 per type; fine for dev workstations today.

Testing Assessment ¶

BATS coverage for the new functionality is solid: engine.bats exercises image pull/untag/dangling, container lifecycle, volume mirroring (including RFC-1123-invalid names), bidirectional deletion, spec.state transitions, the non-moby NotApplicable path, and rdd set wait semantics for already-stopped / --timeout=0. The new sync_containers_test.go unit test covers parseContainerName, and discovery_test.go covers the InitDiscovery/MarkControlPlaneReady lifecycle.

Untested scenarios surfaced by the findings above:

Cold-start race on rdd set running=true (I1). engine.bats starts the service against an already-ready control plane or lets startup fully settle before asserting. A test that patches App the moment ReadyAnnotation lands — before registerDiscovery completes — would catch the ordering regression. Given the tight window, a unit-level test on waitForDesiredState against a mocked discovery ConfigMap is more practical.
Non-engine resources in rancher-desktop (I2). No test creates a Container/Image/Volume without the engine finalizer and verifies that engine stop / full-sync leaves it alone. The containers-mock.bats suite has already been modified to exclude engine, which documents the problem instead of guarding against it.
Standalone app-controller with the engine import (I3). No test runs cmd/app-controller with the embedded control plane disabled or with engine explicitly missing, so the misregistration path is unguarded.
Watcher survives manager shutdown without cli.Close (S1). The missing shutdown hook is invisible at the unit-test level because the process exits anyway; an integration test that inspects /proc/self/fd before and after shutdown on Linux would surface the leaked Docker client connection.
engine.bats teardown (S2). A forced-failure test that asserts the next run starts cleanly would catch the leak.

None of the reviewers ran go test this round; all analysis is static. The round-3 review ran go test ./pkg/controllers/app/engine/controllers ./pkg/service/controllers ./pkg/controllers/containers/container ./pkg/controllers/app/app/controllers against a prior revision and all four packages passed; re-running on a5235c8 would be worthwhile before merge.

Documentation Assessment ¶

Design docs remain consistent with the implementation: api_app.md (new ContainerEngineReady condition), api_containers.md (engine mirroring + spec.state + finalizer lifecycle), cmd_app.md (--timeout and wait semantics, with the --controllers=-engine fallback caveat added in round 3), and discovery.md (ready annotation).

Two gaps remain after this round:

--controllers help text in pkg/service/controllers/options.go:32 and cmd/rdd/service.go:174 does not mention the containers group — tracked as S4.
docs/design/discovery.md describes the ready annotation contract but does not mention that EnabledControllers is populated after the annotation. Documenting that ordering would have helped this round catch I1 earlier, and clients that depend on the list should know to wait past just the annotation.

Acknowledged Limitations ¶

cmd/rdd/set.go:469-472 — Changes that trigger a VM restart without changing running (for example containerEngine.name alone) are still not waited on. The TODO proposes a Reconciled condition as the long-term fix. Unchanged from prior rounds.
pkg/controllers/app/engine/controllers/engine_reconciler.go:181-185 — reconcileContainerSpecs and processFinalizers are O(N) per child event. Per-resource reconcilers with watch predicates are the named long-term fix. Interacts with I2 — the ownership-label change would compose well with the split.
pkg/controllers/app/engine/controllers/docker_watcher.go:179-185 — Transient handleEvent errors for image/volume create are logged and dropped; the mirror stays missing until the next full resync. Periodic fullSync tick is deferred.
pkg/controllers/app/engine/controllers/sync_images.go:243-249 — removeImagesByID matches on status.id; a narrow orphan window exists if metadata Apply succeeded but status Apply failed transiently. Self-heals on the next watcher restart.
pkg/controllers/app/engine/controller.go:23-29 — Engine controller registration skipped on Windows until WSL2 socket access is wired up. The round-3 fix to waitForDesiredState means rdd set running=true no longer hangs on Windows, but the mirroring functionality itself is still unavailable there.
pkg/controllers/app/engine/controllers/engine_reconciler.go:136-147 — ContainerEngineReady is moby-only; to be renamed when containerd mirroring lands.

Unresolved Feedback ¶

BATS $(rdd svc paths docker_socket) at file scope (bats/tests/32-app-controllers/engine.bats:13-14). Flagged by Claude and Gemini this round, and as S2 in round 3 — declined in all prior rounds on the grounds that rdd svc paths docker_socket is a pure computation with no real failure mode. Recording here rather than in Suggestions because the author's position is on the record and unchanged. If the decision is that the project's BATS style rule applies only when the substituted command has a real failure path (socket probes, process queries), that carve-out would be worth documenting in bats-style.md so reviewers and agents stop re-flagging it.

Agent Performance Retro ¶

Claude ¶

Claude caught the cold-start race on rdd set running=true and filed it as S1, with a correct root-cause description but insufficient urgency — the fallback from ContainerEngineReady to Running=True is exactly the regression the PR exists to close. Codex filed the same code path at Important with a tighter file:line trace, and we promoted Claude's S1 into the consolidated I1. Claude's unique contribution was the context.WithoutCancel shutdown leak, a clean catch on a round-3 fix: it identified that the decoupling is too complete and that no manager-level cancellation path exists. Claude withdrew one self-filed finding (I3, stale-discovery cleanup) mid-review after re-reading the code, which is the right behaviour and worth rewarding. Its Coverage Summary was the most complete of the three, flagging per-file "Trivial" vs "Reviewed, no issues". Calibration was mostly sound though Claude's I1 (WithoutCancel) reads closer to Suggestion-severity; we downgraded.

Codex ¶

Codex delivered the three highest-impact structural findings of the round and the only Important findings that survived consolidation unchanged. The ReadyAnnotation-vs-registerDiscovery ordering trace crosses four files (manager.go, set.go, discovery.go, service.go) and gets the causality right on the first try. The "engine deletes unowned objects" finding correctly links the code behaviour to the visible test-suite carve-out in containers-mock.bats, which is the kind of cross-file reasoning that separates a ship-blocking finding from a theoretical one. The app-controller import finding contradicts Gemini on the same code path and is right. Codex did not find the shutdown leak Claude caught. It did not run tests this round (prior rounds did). Calibration was correct on all three Important findings — no reconciliation adjustments needed.

Gemini ¶

Gemini's output this round was net-negative. Its Critical finding (C1, invalid @test syntax) is a hallucination: the engine.bats file uses @test correctly, verified directly on disk. Its Important I1 (API group mismatch with app-controller silently omitting engine) is backwards — pkg/external/main.go does not filter by GetAPIGroup(), and Codex's finding on the same code path has the causality right. Its Important I2 (BATS $(...) substitution) is a re-flag of a standing-declined style issue. We dropped C1 and I1 as false positives and routed I2 to Unresolved Feedback. Gemini skipped git blame again (consistent with its quota-driven behaviour), so it could not distinguish regressions from pre-existing issues — which might be why it spent its analysis budget on hallucination-prone structural claims rather than on tracing the round-3 fixes. This is the second round where Gemini's calibration has been notably worse than the other two agents; worth watching whether the pattern is model-level or this-diff-specific.

Summary ¶

	Claude Opus 4.6	Codex GPT 5.4	Gemini 3.1 Pro
Duration	3m 34s	7m 26s	6m 54s
Findings	3S	3I 1S	none
Tool calls	35 (Read 19, Grep 9, Bash 7)	78 (shell 77, plan 1)	21 (grepsearch 12, runshellcommand 8, readfile 1)
Design observations	5	4	3
False positives	1	0	2
Unique insights	1	3	0
Files reviewed	39	39	39
Coverage misses	0	0	0
Totals	3S	3I 1S	none
Downgraded	1 (I→S)	0	0
Dropped	3	0	3

Reconciliation — Claude P1 isEngineControllerEnabled race: suggestion → important I1 (promoted; shared with Codex and classified as a regression-reintroducing bug rather than a soft concern). Claude P1 context.WithoutCancel shutdown leak: important → suggestion S1 (downgraded; narrow shutdown-only impact and process exits soon after). Gemini P1 invalid @test syntax: critical → dropped as false positive (file verified to use @test correctly). Gemini P1 API-group mismatch: important → dropped as false positive (the causality trace contradicts pkg/external/main.go, which does not filter by GetAPIGroup()). Gemini P1 BATS $(...) substitution: important → dropped to Unresolved Feedback (standing-declined in prior rounds). Claude P1 image-name length (S3 in its review): dropped as false positive (DNS-1123 subdomain, 253-char limit applies). Claude P1 waitForDesiredState non-running TODO: dropped (acknowledged limitation, not a finding). Claude P1 stale-discovery cleanup: withdrawn by Claude itself during review.

Overall, Codex carried the round. Claude contributed one valuable unique finding and corroborated the highest-impact regression; Gemini did not usefully improve the output and cost orchestration time on verification. For a round-4 review of a bundle of targeted fixes, two reliable agents is the right spread — Gemini's hallucinations this round are a reminder that agent output must be verified against source before it reaches a report.

Review Process Notes ¶

Skill improvements ¶

When reviewing a follow-up round, treat each prior-round fix as a dedicated review target. Every fix is a small patch against known-good code, so the reviewer should explicitly ask: "what new code path does this introduce, and what else reads the state this change writes?" Three of the four round-4 findings (I1, I3, S1) are direct consequences of round-3 fixes interacting with unrelated code paths. A review checklist item of the form "for each prior-round fix, list every caller of the modified symbol and verify none depend on the prior timing or shape" would have caught I1 (the new isEngineControllerEnabled caller vs the existing registerDiscovery ordering) on the first pass rather than promoting a suggestion into an important.

Repo context updates ¶

UPDATE deep-review-context.md: add the discovery ConfigMap ordering contract. The file already documents that CRDs become ready when rdd.rancherdesktop.io/ready=true appears on the discovery ConfigMap. Round 4 surfaced a subtler invariant: EnabledControllers in configMap.Data is populated after the ready annotation, because registerDiscovery runs later in SharedControllerManager.Start. Any client that reads both the annotation and the enabled-controller list races. Add a one-paragraph note in the "Control plane readiness and discovery" subsection stating that the annotation and the enabled-controller list are not consistent checkpoints, and that code depending on both must serialize through something stricter (e.g. a controller-manager-health probe or a precomputed enabled-controller key written before MarkControlPlaneReady). Rationale: I1 this round could have been prevented by a repo-context entry that a reviewer would have read before pattern-matching on "read discovery, make decision".
ADD to deep-review-context.md: engine controller deletes unowned objects in rancher-desktop namespace. Until I2 is fixed, any reviewer looking at a PR that adds a new controller touching Container/Image/Volume kinds needs to know that engine will delete those objects on VM stop/backend switch unless they carry the engine finalizer. Add a short warning in the "Engine controller" subsection saying that Container/Image/Volume mirrors in rancher-desktop are co-owned space and a second controller must either pick a different namespace or coordinate with engine's prune logic. Drop this entry once I2 is resolved with an ownership marker.