Deep Review: 20260415-103429-pr-309

Date	2026-04-15 10:34
Repo	rancher-sandbox/rancher-desktop-daemon
Round	5 (of pr-309)
Author	@jandubois
PR	#309 — engine: mirror Docker container engine state into Kubernetes
Branch	`engine-v3`
Commits	`f3856c1` engine: mirror Docker container engine state into Kubernetes
Reviewers	Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro
Verdict	Merge with fixes — two real regressions: `containernamespace/moby` is not self-healing after user deletion, and the external `containers-controller` binary never registers the engine controller. Two narrower races around the discovery ready annotation are worth tightening before merge.
Wall-clock time	`13 min 40 s`

Executive Summary ¶

The PR adds the engine controller under pkg/controllers/app/engine/, plus rdd set wait semantics, a discovery ready annotation, and a new Container.spec.state=unknown value. The mirroring design (SSA dual field manager, daemon-clock-anchored event replay, retry.RetryOnConflict across every App status writer, idempotent fullSync on reconnect) is sound and well-documented. The suite finds no critical issues.

Four important findings warrant pre-merge attention: (1) ContainerNamespace moby is only applied during fullSync, so a user delete is never restored until the watcher cycles — a BATS test even locks in the broken behavior with refute_output; (2) the engine controller's blank import is only in the embedded rdd svc serve path, not in cmd/containers-controller/main.go, so the external binary silently drops the feature; (3) MarkControlPlaneReady fires before mgr.Start(), creating a narrow window where a client observes the annotation but webhooks and reconcilers have not yet started; (4) a failed registerDiscovery is logged-not-fatal but the ready annotation is still written over an empty controller list, which flips isEngineControllerEnabled to the wrong wait branch in rdd set.

Critical Issues ¶

None.

Important Issues ¶

I1. ContainerNamespace moby is not recreated after user deletion Codex¶

pkg/controllers/app/engine/controllers/engine_reconciler.go

		func(_ context.Context, _ client.Object) []reconcile.Request {
			return []reconcile.Request{{NamespacedName: types.NamespacedName{Name: appName}}}
		},
	)

	return ctrl.NewControllerManagedBy(mgr).
		For(&appv1alpha1.App{}).
		Named("engine-reconciler").
		WatchesRawSource(source.Channel(r.reconcileChan, enqueueApp)).
		Watches(&containersv1alpha1.Container{}, enqueueApp).
		Watches(&containersv1alpha1.Image{}, enqueueApp).
		Watches(&containersv1alpha1.Volume{}, enqueueApp).
		Complete(r)
}

syncContainerNamespace is only called from fullSync (docker_watcher.go:436-438), and the steady-state reconcile path in Reconcile never touches the ContainerNamespace mirror. The controller graph also does not Watches it, so deleting containernamespace/moby fires no event, triggers no reconcile, and the mirror stays gone until the watcher cycles (Docker daemon restart or VM restart). The API stops reflecting the running engine's namespace in the meantime.

pkg/controllers/app/engine/controllers/docker_watcher.go

		return fmt.Errorf("failed to ensure namespace: %w", err)
	}

	var errs []error

	if err := w.syncContainerNamespace(ctx); err != nil {
		errs = append(errs, fmt.Errorf("failed to sync container namespace: %w", err))
	}
	if err := w.syncAllContainers(ctx); err != nil {
		errs = append(errs, fmt.Errorf("failed to sync containers: %w", err))
	}
	if err := w.syncAllImages(ctx); err != nil {
		errs = append(errs, fmt.Errorf("failed to sync images: %w", err))

The new BATS suite actually locks this behavior in: engine.bats:375-381 deletes containernamespace/moby and then asserts refute_output on the list, which passes only because the mirror never comes back.

bats/tests/32-app-controllers/engine.bats

    rdd set running=true
    rdd ctl wait --for=create --namespace="${NAMESPACE}" \
        containernamespace/moby --timeout=10s
}

@test "deleting containernamespace/moby completes without a finalizer hang" {
    # moby ContainerNamespace has no mirror finalizer, so a user delete
    # must return promptly rather than get trapped in Terminating.
    rdd ctl delete containernamespace/moby --namespace="${NAMESPACE}" --timeout=10s
    run -0 rdd ctl get containernamespaces --namespace="${NAMESPACE}" --output=name
    refute_output
}

# --- containerd backend ---

@test "containerd backend reports ContainerEngineReady=NotApplicable and skips mirroring" {
    # Stop first so there is no stale True/Connected from moby to

@test "deleting containernamespace/moby completes without a finalizer hang" {
    rdd ctl delete containernamespace/moby --namespace="${NAMESPACE}" --timeout=10s
    run -0 rdd ctl get containernamespaces --namespace="${NAMESPACE}" --output=name
    refute_output
}

Fix: add ContainerNamespace to the controller's Watches list, and re-apply containernamespace/moby as part of the steady-state reconcile (or gate it behind wantWatcher the same way reconcileContainerSpecs is). Flip the BATS test to rdd ctl wait --for=create containernamespace/moby --timeout=10s so the self-heal is verified instead of masked.

I2. External containers-controller binary does not register the engine controller Codex¶

cmd/containers-controller/main.go

// SPDX-FileCopyrightText: The Rancher Desktop Authors

// Command containers-controller implements the containers API.
package main

import (
	"os"

	// Import app controller packages to trigger init() functions.
	_ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/containers"
	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/external"
)

func main() {
	os.Exit(external.RunControllers("containers"))
}

The engine controller declares itself part of the containers API group (pkg/controllers/app/engine/controller.go:38) with the explicit rationale that it watches Container, Image, and Volume. The embedded path in pkg/service/cmd/service.go:59 blank-imports pkg/controllers/app/engine, but cmd/containers-controller/main.go only imports pkg/controllers/containers, so the external binary never sees the engine controller's init() and base.RegisterController never runs. external.RunControllers("containers") therefore exposes the containers CRDs without any Docker mirroring.

pkg/controllers/app/engine/controller.go

	controllerName = "engine"
	// apiGroup is "containers" because the reconciler watches
	// Container, Image, and Volume from that group and needs
	// their CRDs at startup. Grouping engine with its dependencies
	// keeps --controllers selections from splitting the two apart.
	apiGroup = "containers"
)

type controller struct{}

func newController() base.Controller {

pkg/service/cmd/service.go


	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/cli/help"
	// Import controller packages to trigger init() functions for embedded mode.
	_ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/app/app"
	_ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/app/demo"
	_ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/app/engine"
	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/base"
	// Import built-in system controllers.
	_ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/builtin/namespace"
	// Import controller packages to trigger init() functions for embedded mode.
	_ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/containers"

Either the grouping is right and the import is missing, or the grouping should change to app and the import should move to cmd/app-controller/main.go. Picking one and making the import graph match is the fix.

 import (
     "os"

-    _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/containers"
+    _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/app/engine"
+    _ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/containers"
     "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/external"
 )

I3. MarkControlPlaneReady fires before mgr.Start() registers controllers and webhooks Claude¶

pkg/service/controllers/manager.go

	if err := scm.registerDiscovery(ctx); err != nil {
		log.Error(err, "Failed to register controller manager for discovery")
		// Don't fail startup for discovery registration errors
	}

	// Mark the control plane ready after registerDiscovery so clients
	// waiting on the ready annotation see both CRDs installed and
	// controller registration written, not just CRDs.
	if err := MarkControlPlaneReady(ctx, scm.discovery.client); err != nil {
		return fmt.Errorf("failed to mark control plane as ready: %w", err)
	}

	// Ensure cleanup on shutdown with a timeout to avoid blocking if apiserver is gone
	defer func() {
		cleanupCtx, cleanupCancel := context.WithTimeout(context.Background(), 5*time.Second)
		defer cleanupCancel()

MarkControlPlaneReady at line 201 runs well before mgr.Start(ctx) at line 284. Between those two points, Start still iterates scm.registrations to call RegisterWithManager (line 223-235) and then runs webhook Setup in parallel goroutines (line 241-261). mgr.Start is where controller-runtime actually starts the reconcile loops and the webhook HTTP server. A client that observes the ready annotation at this point and immediately issues a request can:

pkg/service/controllers/manager.go


	// Register all controllers before launching webhook goroutines.
	// RegisterWithManager calls AddToScheme, which writes to the scheme map.
	// Webhook setup reads the scheme via client.Apply. Running both concurrently
	// causes a fatal "concurrent map read and map write" crash.
	for _, registration := range scm.registrations {
		klog.InfoS("Registering controller with shared manager", "controller", registration.GetName())

		// If the controller needs the webhook port, provide it
		if webhookController, ok := registration.(base.WebhookController); ok {
			klog.InfoS("Setting webhook port for controller", "controller", registration.GetName(), "port", scm.webhookPort)
			webhookController.SetWebhookPort(scm.webhookPort)
		}

		if err := registration.RegisterWithManager(mgr); err != nil {
			return fmt.Errorf("failed to register controller %s: %w", registration.GetName(), err)
		}
	}

	// Now that all scheme mutations are done, set up webhooks in parallel.
	var webhookWaitGroup sync.WaitGroup
	webhookErrors := make(chan error, len(scm.registrations))

	for _, registration := range scm.registrations {
		if webhookController, ok := registration.(base.WebhookController); ok {
			managers := webhookController.GetWebhookManagers()
			for i, manager := range managers {
				if manager == nil {
					continue
				}
				name := registration.GetName()
				webhookWaitGroup.Add(1)
				go func() {
					defer webhookWaitGroup.Done()
					klog.Infof("Starting webhook setup for controller %s (webhook %d)", name, i)
					if err := manager.Setup(); err != nil {
						webhookErrors <- fmt.Errorf("controller %s webhook %d: %w", name, i, err)
					}
				}()
			}
		}
	}

	webhookWaitGroup.Wait()
	close(webhookErrors)

	var errs []error
	for err := range webhookErrors {
		errs = append(errs, err)

pkg/service/controllers/manager.go

		"controllers", len(scm.registrations),
		"metricsPort", scm.metricsPort,
		"healthPort", scm.healthPort)

	// Start the manager (this blocks until context is cancelled)
	if err := mgr.Start(ctx); err != nil {
		return fmt.Errorf("failed to start shared controller manager: %w", err)
	}

	return nil
}

hit an admission webhook whose HTTPS server is not yet listening, returning "connection refused" from the apiserver's webhook call, or
land on a reconciler that has not registered yet and see a spec change that is never reconciled until the next event fires.

The window is short in practice because rdd set's own ensureServiceRunning + waitForFreshDiscoveryConfigMap gives the manager time to finish startup, but the annotation's documented contract ("every enabled controller has installed its CRDs") is stronger than what the code currently guarantees, and the whole point of the annotation is to let clients skip ad-hoc waits.

Fix: move MarkControlPlaneReady into a manager.RunnableFunc added to mgr so it runs only after Start has initialized controllers and webhooks. A RunnableFunc added before mgr.Start fires once the manager is up.

I4. registerDiscovery failure is swallowed but MarkControlPlaneReady still runs Claude¶

pkg/service/controllers/manager.go

		return fmt.Errorf("failed to set up ready check: %w", err)
	}

	// Register controller manager in cluster for service discovery before
	// running the controllers.
	if err := scm.registerDiscovery(ctx); err != nil {
		log.Error(err, "Failed to register controller manager for discovery")
		// Don't fail startup for discovery registration errors
	}

	// Mark the control plane ready after registerDiscovery so clients
	// waiting on the ready annotation see both CRDs installed and
	// controller registration written, not just CRDs.
	if err := MarkControlPlaneReady(ctx, scm.discovery.client); err != nil {
		return fmt.Errorf("failed to mark control plane as ready: %w", err)
	}

	// Ensure cleanup on shutdown with a timeout to avoid blocking if apiserver is gone
	defer func() {
		cleanupCtx, cleanupCancel := context.WithTimeout(context.Background(), 5*time.Second)
		defer cleanupCancel()

If registerDiscovery fails at 193, execution falls through and MarkControlPlaneReady stamps the ready annotation on a discovery ConfigMap that has no controller entries. rdd set's isEngineControllerEnabled (cmd/rdd/set.go:523-537) then reads an empty controller list, decides the engine controller is not enabled, and falls back to waiting on Running=True — which returns before the Docker full sync has finished. The user sees rdd set running=true complete while the engine is still catching up.

cmd/rdd/set.go

	})
}

// isEngineControllerEnabled reports whether the engine controller
// is listed in the discovery ConfigMap's EnabledControllers.
func isEngineControllerEnabled(ctx context.Context, config *rest.Config) (bool, error) {
	discovery, err := controllers.NewControllerManagerDiscovery(config)
	if err != nil {
		return false, fmt.Errorf("failed to create discovery client: %w", err)
	}
	enabled, err := discovery.GetEnabledControllers(ctx)
	if err != nil {
		return false, fmt.Errorf("failed to list enabled controllers: %w", err)
	}
	for _, name := range enabled {
		if name == engineControllerName {
			return true, nil
		}
	}
	return false, nil
}

// watchCondition watches the App singleton until the predicate
// returns true or the context expires. watchtools.UntilWithSync
// re-lists transparently on benign watch-channel closures (API

Fix: either make registerDiscovery failure fatal (it already returns an error), or skip MarkControlPlaneReady when it fails. The "don't fail startup" comment pre-dates the new ready-annotation contract and is now at odds with it.

-	if err := scm.registerDiscovery(ctx); err != nil {
-		log.Error(err, "Failed to register controller manager for discovery")
-		// Don't fail startup for discovery registration errors
-	}
-
-	if err := MarkControlPlaneReady(ctx, scm.discovery.client); err != nil {
+	if err := scm.registerDiscovery(ctx); err != nil {
+		return fmt.Errorf("failed to register controller manager for discovery: %w", err)
+	}
+	if err := MarkControlPlaneReady(ctx, scm.discovery.client); err != nil {

Suggestions ¶

S1. BATS DOCKER_HOST is resolved at file-load time via bare command substitution Claude Codex¶

bats/tests/32-app-controllers/engine.bats

# Engine controller tests — verify that the engine controller mirrors Docker
# containers, images, and volumes into Container, Image, and Volume
# resources, and that deletions and spec.state changes are forwarded
# to Docker.

NAMESPACE="rancher-desktop"
DOCKER_HOST="unix://$(rdd svc paths docker_socket)"
export DOCKER_HOST

local_setup_file() {
    # The Docker socket access pattern used by these tests is not yet wired
    # up for Windows/WSL2.
    skip_on_windows

The assignment on line 13 runs during BATS file load, before local_setup_file has a chance to run rdd svc delete / rdd set running=true. It works today because rdd svc paths docker_socket is a static path computation, but (a) it bypasses the repo's BATS convention of capturing command output via run -0 so failures surface the real stderr instead of an opaque shell-load error, and (b) it couples the test to rdd svc paths never needing a running service, which is not promised by the command's contract.

Fix: move the resolution into local_setup_file after the service is up, using run -0 per the BATS style rules.

 NAMESPACE="rancher-desktop"
-DOCKER_HOST="unix://$(rdd svc paths docker_socket)"
-export DOCKER_HOST

 local_setup_file() {
     skip_on_windows
     rdd svc delete
     rdd set running=true
+    run -0 rdd svc paths docker_socket
+    DOCKER_HOST="unix://${output}"
+    export DOCKER_HOST
 }

S2. setCondition in the generic Container reconciler omits ObservedGeneration Gemini¶

pkg/controllers/containers/container/container_reconciler.go


	return ctrl.Result{}, nil
}

// setCondition sets or updates a condition in the container status and
// logs every state change.
func (r *reconciler) setCondition(ctx context.Context, container *containersv1alpha1.Container, conditionType string, status metav1.ConditionStatus, reason, message string) {
	changed := apimeta.SetStatusCondition(&container.Status.Conditions, metav1.Condition{
		Type:    conditionType,
		Status:  status,
		Reason:  reason,
		Message: message,
	})
	if !changed {
		return
	}
	logf.FromContext(ctx).Info("Condition changed",
		"type", conditionType, "status", status, "reason", reason, "message", message)

The engine controller and rdd set rely on observedGeneration >= post-write generation to tell a stale condition from a fresh one, and EngineReconciler.setEngineCondition / AppReconciler both stamp it. The generic Container reconciler's setCondition helper does not, so any future tooling that polls Container conditions — or any additional condition the generic controller grows — will silently lack the staleness signal. Cheap to add now; expensive to retrofit once a consumer depends on it.

Fix: pass container.Generation into the metav1.Condition literal.

S3. engineControllerName duplicates an unexported constant in the engine package Claude¶

cmd/rdd/set.go

const (
	appCRDName = "apps.app.rancherdesktop.io"

	// engineControllerName mirrors the unexported controllerName in
	// pkg/controllers/app/engine/controller.go.
	engineControllerName = "engine"
)

var appGVR = schema.GroupVersionResource{
	Group:    appv1alpha1.GroupVersion.Group,
	Version:  appv1alpha1.GroupVersion.Version,

isEngineControllerEnabled compares the discovery ConfigMap's controller list against this local copy. If either side drifts, the function silently falls back to the wrong rdd set wait path (Running=True instead of ContainerEngineReady=True) — a "successful" command that returns before mirroring is ready. Export the constant from pkg/controllers/app/engine/controller.go so drift is a compile error.

S4. startWatcher runs fullSync under watcherCtx, not Reconcile's ctx Claude¶

pkg/controllers/app/engine/controllers/engine_reconciler.go

// startWatcher creates and starts a Docker watcher goroutine. The
// watcher inherits watcherCtx, which cancels only on manager
// shutdown; startWatcher deliberately drops Reconcile's ctx so a
// future ReconciliationTimeout or per-request deadline cannot kill
// the watcher the moment Reconcile returns.
func (r *EngineReconciler) startWatcher(_ context.Context) error {
	r.watcherMu.Lock()
	defer r.watcherMu.Unlock()

	if r.watcher != nil {
		return nil
	}

	w, err := newDockerWatcher(r.watcherCtx, r.Client, r.reconcileChan)
	if err != nil {
		return err
	}
	r.watcher = w
	return nil
}

// stopWatcher stops the Docker watcher goroutine and waits for it to finish.
func (r *EngineReconciler) stopWatcher() {
	r.watcherMu.Lock()

The _ context.Context parameter makes the drop explicit and the comment at the callsite explains why, but newDockerWatcher runs fullSync synchronously before returning. A future ReconciliationTimeout on the engine controller would not apply to fullSync because watcherCtx only cancels on manager shutdown. A short comment on startWatcher noting that fullSync intentionally runs under watcherCtx, and that adding a reconciliation timeout would require decoupling fullSync from the startWatcher call, would save a reader the trace.

S5. reconcileContainerSpecs and processFinalizers read a stale watcher snapshot Claude¶

pkg/controllers/app/engine/controllers/engine_reconciler.go

// calling Docker start/stop. Per-container errors are joined and
// returned so controller-runtime requeues with backoff; without the
// return, a failed start/stop would get exactly one attempt and then
// sit forever.
func (r *EngineReconciler) reconcileContainerSpecs(ctx context.Context) error {
	r.watcherMu.Lock()
	w := r.watcher
	r.watcherMu.Unlock()
	if w == nil {
		return nil
	}

	var containers containersv1alpha1.ContainerList
	if err := r.List(ctx, &containers, client.InNamespace(apiNamespace)); err != nil {
		return err
	}

Both methods snapshot r.watcher under the lock and then use it without resynchronising. If the watcher dies between the snapshot and the Docker API calls, the Docker client is already closed (w.stop() calls cli.Close()), so the first attempt fails with use of closed network connection instead of a clearer "watcher died" message. The reconcile self-heals on requeue, so the user impact is only a noisier log. Either accept it with a one-line comment, or re-check w.alive() at the top of each method.

S6. Removing UnregisterControllerManager from the preamble loses a hint for future readers Claude¶

The diff drops a stale-cleanup call that used to run at the top of Start(). The reason is sound — InitDiscovery deletes and recreates the ConfigMap, so the preamble unregister is redundant — but neither the commit message nor the code says so. A one-line comment at the deferred cleanup block (line 206-212) noting that stale entries are handled by InitDiscovery would prevent a future reader from reintroducing the call.

pkg/service/controllers/manager.go

	if err := MarkControlPlaneReady(ctx, scm.discovery.client); err != nil {
		return fmt.Errorf("failed to mark control plane as ready: %w", err)
	}

	// Ensure cleanup on shutdown with a timeout to avoid blocking if apiserver is gone
	defer func() {
		cleanupCtx, cleanupCancel := context.WithTimeout(context.Background(), 5*time.Second)
		defer cleanupCancel()
		if err := scm.discovery.UnregisterControllerManager(cleanupCtx); err != nil {
			log.Error(err, "Failed to unregister controller manager")
		}
	}()

	// Add pass through server
	if err := mgr.Add(manager.RunnableFunc(scm.runPassthroughServer)); err != nil {
		return fmt.Errorf("failed to add pass through server to manager: %w", err)
	}

Design Observations ¶

Concerns

One-shot fullSync vs steady-state reconcile split (in-scope) Codex — The engine controller divides work between a one-time fullSync path that seeds scaffolding (e.g. ContainerNamespace, the rancher-desktop namespace) and a steady-state reconcile that only maintains Container/Image/Volume. I1 is the first consequence of that split: scaffolding resources exist only while startup or reconnect happens to recreate them. An explicit "ensure mirror scaffolding" step in steady-state reconcile is a small in-scope fix; the longer-term direction is the per-resource reconcilers with watch predicates already hinted in the code.
Per-reconcile List calls in reconcileContainerSpecs and processFinalizers (in-scope) Claude — Both methods list every mirror on every reconcile, and every child event enqueues the App singleton, so cost is O(children) per burst. The code's own TODO at engine_reconciler.go:191-195 calls out the long-term fix (per-resource reconcilers with deletion/spec-change predicates). Fine for a dev workstation, noticeable once a user runs hundreds of containers.
Controller grouping is encoded in three places (future) Codex — GetAPIGroup(), the embedded blank imports, and the external-binary blank imports all have to agree. I2 is the first divergence in that arrangement. A single per-group registration package imported by both embedded and external entrypoints would remove the whole class of drift.
rdd set wait only covers running=true/false (future) Claude — cmd/rdd/set.go:469-472 still has a TODO that changes like containerEngine.name trigger a VM restart without being waited on. The Reconciled condition design in MEMORY.md is the clean path forward.

Strengths

Dual SSA field manager for finalizer-only applies Claude Codex Gemini — Using engine-controller-finalizer as a separate field owner for finalizer reassertions is the right way to prevent a finalizer apply from releasing spec ownership, which would otherwise prune spec and fail the CRD's required-field validation. Documented clearly at sync_containers.go:138-148.
Daemon-clock anchoring for event replay Claude Codex Gemini — dockerEventsSince reads Info.SystemTime from the daemon and uses that as the Docker events Since filter, neutralising host/guest clock skew. The fallback to a biased host clock (now - 2m) pairs cleanly with an idempotent fullSync, so replaying a few extra events is safe.
retry.RetryOnConflict + re-Get across every App status writer Claude — Applied consistently in setEngineCondition, AppReconciler, removeMirrorResource, deleteAllOfType, and every finalizer processor. This is exactly the pattern needed now that two controllers (EngineReconciler and AppReconciler) both write to App.Status.Conditions and controller-runtime does not serialise across controllers.
watcherCtx scoped to manager lifetime, not Reconcile Claude Gemini — Dropping Reconcile's ctx and seeding the watcher with a manager.RunnableFunc-managed context prevents the Docker client from dying on every Reconcile return, and the shutdown hook guarantees clean teardown on manager exit.
Discovery ready annotation replaces ConfigMap-existence probes Gemini — Promoting readiness from "discovery ConfigMap exists" to an explicit rdd.rancherdesktop.io/ready=true annotation is the right direction; I3 is a sequencing issue with the current fire point, not with the design.
Immutable-spec webhook permits state transitions cleanly Claude — container_webhooks.go:38-43 copies the old spec, replaces only State, and then compares, which is the right way to allow one mutable field without enumerating all immutable ones.

Testing Assessment ¶

The new bats/tests/32-app-controllers/engine.bats suite covers the critical paths well: image pull / tag / untag / dangling / remove, container lifecycle and spec.state transitions, volumes with RFC-1123-invalid names, bidirectional deletion, rdd set wait semantics including --timeout=0, and the containerd no-op path. The unit test at sync_containers_test.go covers parseContainerName adequately, and discovery_test.go covers the InitDiscovery → MarkControlPlaneReady lifecycle.

Gaps ranked by risk:

Self-heal of containernamespace/moby — not only untested, the test at engine.bats:375-381 actively locks in the broken behavior (I1). Flipping refute_output to rdd ctl wait --for=create reverses the mask into a verification.
External containers-controller binary registration — no smoke test verifies that the binary registers the engine controller via discovery, so I2 shipped unnoticed. A smoke test that containers-controller exposes engine in discovery would catch future import/group mismatches.
Watcher reconnect on Docker daemon restart — the w.alive() detection, watcherDied log, and reconnect-triggers-fullSync path are only exercised if Docker actually disconnects during the test, which nothing currently does.
Webhook rejects non-state spec changes on existing containers — the suite exercises spec.state transitions but no test tries to change spec.image on an existing container and asserts the webhook rejects it.
processImageFinalizers on an image Docker refuses to delete — the suite covers in-use image finalizer retention, but not a permission-denied / structural error path where ImageRemove returns non-NotFound.
DOCKER_HOST resolution failure surfacing — S1: the current bare substitution hides socket resolution errors.

Documentation Assessment ¶

The design-doc updates (api_app.md, api_containers.md, cmd_app.md, discovery.md) are thorough and accurately describe the new behavior. api_containers.md's terminology section clarifying Container/Image/Volume (the rdd K8s resource) vs container/image/volume (the Docker engine object) is a notable addition that will pay off in future reviews. The ControllersFlagUsage constant in options.go keeps the --controllers help text in sync between embedded and standalone binaries.

Missing: an explicit statement of which standalone binary hosts engine. The code says apiGroup = "containers", but the external import graph disagrees (I2); picking one and documenting it closes the loop.

Acknowledged Limitations ¶

Per-reconcile List calls — documented inline at engine_reconciler.go:191-195; long-term fix is per-resource reconcilers.
rdd set does not wait on non-running spec changes — documented at cmd/rdd/set.go:469-472; design of a Reconciled condition is in MEMORY.md.
No containerd mirroring — documented in the code (engine_reconciler.go:147-156) and the design doc; returns NotApplicable on containerd.
Container.spec.state is level-triggered — documented in MEMORY.md as a known design limitation to be addressed with edge-triggered action annotations after containerd mirroring.
Windows Docker socket unsupported — documented in both the controller init() guard and skip_on_windows in engine.bats.
Image mirrors with empty status have no engine reference to forward a delete to — documented inline at sync_images.go:248-252 and in processImageFinalizers.

Agent Performance Retro ¶

Claude ¶

Claude carried the review. It surfaced I3 and I4 (the MarkControlPlaneReady sequencing and the registerDiscovery swallow), both of which required reading the full Start() flow and noticing that the ready annotation promise is stronger than the current code guarantees. It also raised the useful latent-comment pair (S4 fullSync-under-watcherCtx, S5 stale-watcher reference) that are low-impact but reward a future reader. A couple of findings were raised as important but are really polish — S4 and S5 were downgraded during consolidation because they are documented acknowledgments rather than behavior gaps. The coverage summary was disciplined: 40 files with per-file dispositions.

Codex ¶

Codex landed the two biggest findings. I1 (ContainerNamespace self-heal) required both tracing syncContainerNamespace back to its only caller and then noticing that the BATS test at 375-381 locks the broken behavior in — a nice cross-artifact catch. I2 (external-binary wiring) required comparing cmd/containers-controller/main.go against pkg/service/cmd/service.go and checking what GetAPIGroup() returns. Neither of the other two agents caught either issue. Codex's single suggestion (S1 BATS DOCKER_HOST) was the sharper framing — "bypasses run -0 per BATS conventions" — compared to Claude's more generic "moves coupling".

Gemini ¶

Gemini's single finding (S2 missing ObservedGeneration on the generic Container setCondition) is a valid forward-looking gap. Its design observations were crisp and it correctly flagged the discovery ready annotation as a clean direction, but it did not notice that the annotation fires too early (I3) or that registerDiscovery failure desyncs the wait (I4). Gemini marked every manager.go / engine_reconciler.go file "Reviewed, no issues" while those two files between them contain three of the four important findings — a meaningful coverage miss that mostly reflects skipping git blame (documented quota behavior) and a lighter trace of the Start() call path.

Summary ¶

	Claude Opus 4.6	Codex GPT 5.4	Gemini 3.1 Pro
Duration	4m 43s	7m 45s	4m 52s
Findings	2I 7S	2I 1S	1S
Tool calls	30 (Read 18, Bash 11, Agent 1)	80 (shell 77, stdin 3)	7 (grepsearch 3, runshellcommand 3, readfile 1)
Design observations	6	4	4
False positives	0	0	0
Unique insights	2	2	1
Files reviewed	40	40	40
Coverage misses	0	0	2
Totals	2I 7S	2I 1S	1S
Downgraded	2 (I→S)	0	0
Dropped	0	0	0

Reconciliation: Claude I3 startWatcher / fullSync comment: important → suggestion S4 (latent, comment-only). Claude I4 stale-watcher error-message clarity: important → suggestion S5 (self-healing on requeue, cosmetic).

Overall, Codex provided the most merge-blocking value this round (both important findings), Claude provided the highest density of correct findings and the best coverage discipline, and Gemini's contribution was limited to one forward-looking suggestion plus the useful design-observation framing.

Review Process Notes ¶

Repo context updates ¶

Discovery ready annotation sequencing — When reviewing controller-manager startup, trace the fire point of MarkControlPlaneReady against mgr.Start() and flag any path that sets the ready annotation before the manager has actually registered controllers and started webhooks. The annotation's contract is "every enabled controller has installed its CRDs and is reconciling" — clients use it to skip ad-hoc waits, so a premature fire creates a false-ready window in which admission webhooks may not be listening yet. Also flag swallow-then-mark-ready patterns where a registerDiscovery (or equivalent prerequisite) failure is logged-not-fatal but the ready annotation still fires over a partially populated ConfigMap.
Scaffolding resources must be part of steady-state reconciliation — When a reconciler splits work between a one-shot fullSync path and a steady-state path, verify that every mirror resource created by fullSync is either (a) watched and re-applied on steady-state reconcile, or (b) explicitly documented as "startup-only, never self-heals". Singleton scaffolding resources (namespaces, placeholder objects, ContainerNamespace-style markers) are the usual casualties because they're applied once and then never touched by steady-state code. Flag any watch list that omits a resource the controller applies anywhere.
Controller grouping agreement across entrypoints — In repos where controllers are registered via init() + base.RegisterController, verify that every controller's GetAPIGroup() return value matches the blank imports in both embedded and external binary entrypoints. A controller declared in group X must be blank-imported in the binary that runs external.RunControllers("X"). Mismatch is a silent drop — the feature disappears from external mode without any error. Grep for the controller package under both cmd/*/main.go and the embedded service package; a missing import in either is a regression.

Skill improvements ¶

Scan BATS suites for test assertions that lock in broken behavior — When reviewing a test that deletes a resource and then asserts its absence (refute_output, --for=delete), check whether the controller is supposed to recreate that resource. If the expected behavior is self-healing, a one-shot absence check is a false-positive test that hides the bug. Flag when the deletion test's assertion style is tighter than the controller's actual contract suggests.