Deep Review: 20260414-232915-pr-309

Date	2026-04-14 23:29
Repo	rancher-sandbox/rancher-desktop-daemon
Round	3 (of PR #309)
Author	@jandubois
PR	#309 — engine: mirror Docker container engine state into Kubernetes
Branch	`engine-v3`
Commits	`587bfe2` engine: mirror Docker container engine state into Kubernetes
Reviewers	Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro
Verdict	Merge with fixes — one nil-deref and one Windows-wait regression worth fixing before merge; the cross-group registration and namespace self-healing gaps can ship as follow-ups if acknowledged.
Wall-clock time	`32 min 44 s`

Executive Summary ¶

PR #309 adds the engine controller that mirrors Docker containers, images, and volumes into containers.rancherdesktop.io resources, teaches rdd set to wait on post-write App conditions, and adds a discovery ready annotation so CRD consumers stop racing startup. The engine reconciler is well-structured: SSA field-owner separation, retry-on-conflict on the multi-writer App status, daemon-clock-anchored Docker event replay, and idempotent fullSync preserving mirror identities across reconnects.

Four important issues surfaced: (1) rdd set running=true now hangs for the full 5-minute timeout on Windows because no controller writes ContainerEngineReady there; (2) the rancher-desktop namespace and containernamespace/moby are created once in fullSync and never self-healed if a user deletes them mid-run; (3) the new controller is registered under API group app but watches Container/Image/Volume from the containers group, so selecting --controllers=app starts it without the CRDs it depends on; and (4) inspect.Config.Labels is dereferenced without a nil guard on the container and image apply paths, even though NetworkSettings is guarded a few lines down. Three lower-severity findings cover latent watcher-context decoupling, a BATS command substitution at file scope, and a webhook enum check that duplicates CRD validation.

Structure: 0 critical, 4 important, 3 suggestions.

Critical Issues ¶

None.

Important Issues ¶

I1. rdd set running=true hangs on Windows because no controller writes ContainerEngineReady Codex¶

cmd/rdd/set.go

		}
		targetGen = gen
	case err != nil:
		return fmt.Errorf("failed to get App: %w", err)
	default:
		gen, err := patchApp(ctx, c, &app, specMap, dryRun)
		if err != nil {
			return err
		}
		targetGen = gen
	}

	if dryRun || timeout == 0 {
		return nil

// cmd/rdd/set.go:474-479
if runningVal == "true" {
    logrus.Info("Waiting for container engine to be ready")
    return watchCondition(ctx, config, func(obj *unstructured.Unstructured) bool {
        status, gen, present := conditionInfo(obj, "ContainerEngineReady")
        return present && gen >= minGen && status == metav1.ConditionTrue
    })
}

rdd set now calls waitForDesiredState() after every successful write unless --timeout=0, and the running=true branch accepts only ContainerEngineReady=True.

On Windows, pkg/controllers/app/engine/controller.go:23-29 returns from init() before registering the engine controller at all. AppReconciler only mirrors LimaVM conditions and never synthesizes ContainerEngineReady (see pkg/controllers/app/app/controllers/app_controller.go:302-329), so on Windows the condition has no writer and rdd set running=true hangs for the full 5-minute default timeout before erroring out.

pkg/controllers/app/engine/controller.go

	containersv1alpha1 "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/apis/containers/v1alpha1"
	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/app/engine/controllers"
	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/base"
)

func init() {
	// Windows lacks a Docker socket transport, so the reconciler would
	// hot-loop on connect errors. Skip registration until WSL2 support lands.
	if runtime.GOOS == "windows" {
		return
	}
	base.RegisterController(newController())
}

const (
	controllerName = "engine"
	// apiGroup is "containers" because the reconciler watches

pkg/controllers/app/app/controllers/app_controller.go

			return ctrl.Result{}, fmt.Errorf("failed to propagate running state to LimaVM: %w", err)
		}
		return ctrl.Result{}, nil
	}

	// Mirror LimaVM status conditions onto the App status. The engine
	// reconciler writes ContainerEngineReady on the same object, so
	// app's resourceVersion from the initial Get can be stale.
	// retry.RetryOnConflict + re-Get matches
	// EngineReconciler.setEngineCondition; without it, concurrent
	// writers 409-loop through controller-runtime requeues. Truncate
	// messages defensively against a future bypass of the LimaVM
	// controller's source-side truncation.
	if err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
		latest := &v1alpha1.App{}
		if err := r.Get(ctx, client.ObjectKeyFromObject(&app), latest); err != nil {
			return err
		}
		changed := false
		for _, cond := range limaVM.Status.Conditions {
			changed = apimeta.SetStatusCondition(&latest.Status.Conditions, metav1.Condition{
				Type:               cond.Type,
				Status:             cond.Status,
				Reason:             cond.Reason,
				Message:            base.TruncateConditionMessage(cond.Message),
				ObservedGeneration: latest.Generation,
				LastTransitionTime: cond.LastTransitionTime,
			}) || changed
		}
		if !changed {
			return nil
		}
		return r.Status().Update(ctx, latest)
	}); err != nil {
		log.Error(err, "Unable to update App status")
		return ctrl.Result{}, err
	}

Windows support is still a TODO (the engine controller comment calls out that "WSL2 support" is pending), but a 5-minute UX regression on any Windows build path is worse than a clean error. Codex flagged this as Critical; we downgrade to Important because Windows is not yet a shipping target, but it still wants addressing before Windows work lands.

Fix: either (a) register a Windows stub engine reconciler that stamps a terminal ContainerEngineReady=True, Reason=NotApplicable once the App is Running, matching the non-moby path at engine_reconciler.go:136-147, or (b) have waitForDesiredState detect that the engine controller is not enabled (for example, by consulting the discovery ConfigMap's EnabledControllers) and skip the condition wait. Waiting on a condition with no writer is not safe.

I2. rancher-desktop namespace and containernamespace/moby are created once and never repaired Codex Claude¶

pkg/controllers/app/engine/controllers/engine_reconciler.go

// SetupWithManager sets up the controller with the Manager.
func (r *EngineReconciler) SetupWithManager(mgr ctrl.Manager) error {
	r.reconcileChan = make(chan event.GenericEvent, 1)

	// Map any Container/Image/Volume event to a reconcile of the App singleton.
	enqueueApp := handler.EnqueueRequestsFromMapFunc(
		func(_ context.Context, _ client.Object) []reconcile.Request {
			return []reconcile.Request{{NamespacedName: types.NamespacedName{Name: appName}}}
		},
	)

	return ctrl.NewControllerManagedBy(mgr).
		For(&appv1alpha1.App{}).
		Named("engine-reconciler").
		WatchesRawSource(source.Channel(r.reconcileChan, enqueueApp)).
		Watches(&containersv1alpha1.Container{}, enqueueApp).
		Watches(&containersv1alpha1.Image{}, enqueueApp).
		Watches(&containersv1alpha1.Volume{}, enqueueApp).

The engine reconciler watches App, Container, Image, and Volume only. corev1.Namespace and containersv1alpha1.ContainerNamespace are absent from the watch set.

ensureNamespace (sync_volumes.go:40-54) and syncContainerNamespace (sync_volumes.go:61-66) are only reached from fullSync (docker_watcher.go:426-451), which runs at watcher startup. Once the watcher is running, a kubectl delete namespace rancher-desktop or kubectl delete containernamespace moby produces no event that the reconciler cares about. The namespace stays gone — which cascade-deletes every mirror resource in it — and the ContainerNamespace stays gone until the next VM stop/start or Docker reconnect.

pkg/controllers/app/engine/controllers/sync_volumes.go

	return fmt.Sprintf("vol-%x", sum)
}

// ensureNamespace creates the Kubernetes namespace that holds the
// mirror resources if it doesn't exist.
func (w *dockerWatcher) ensureNamespace(ctx context.Context) error {
	var ns corev1.Namespace
	if err := w.k8s.Get(ctx, client.ObjectKey{Name: apiNamespace}, &ns); err != nil {
		if !apierrors.IsNotFound(err) {
			return err
		}
		ns = corev1.Namespace{
			ObjectMeta: metav1.ObjectMeta{Name: apiNamespace},
		}
		if err := w.k8s.Create(ctx, &ns); err != nil && !apierrors.IsAlreadyExists(err) {
			return fmt.Errorf("failed to create namespace %s: %w", apiNamespace, err)
		}
	}
	return nil
}

// syncContainerNamespace creates the "moby" ContainerNamespace mirror.
// This resource has no mirror finalizer: Docker has no corresponding
// engine object for the reverse delete, and cleanupMirrorResources
// sweeps it unconditionally on VM stop, so a finalizer with no
// handler would only trap user deletes in Terminating.
func (w *dockerWatcher) syncContainerNamespace(ctx context.Context) error {
	applyConfig := containersv1alpha1apply.ContainerNamespace(containerNamespace, apiNamespace)

	return w.k8s.Apply(ctx, applyConfig,
		client.ForceOwnership, client.FieldOwner(controllerName))
}

// syncAllVolumes lists all Docker volumes, creates or updates their
// Volume mirrors, and prunes stale ones.
func (w *dockerWatcher) syncAllVolumes(ctx context.Context) error {
	log := logf.FromContext(ctx).WithName("docker-watcher")

pkg/controllers/app/engine/controllers/docker_watcher.go

	return err
}

// fullSync lists all containers, images, and volumes from Docker and
// creates corresponding mirror resources. It also removes stale mirrors.
func (w *dockerWatcher) fullSync(ctx context.Context) error {
	log := logf.FromContext(ctx).WithName("docker-watcher")
	log.Info("Starting full sync")

	if err := w.ensureNamespace(ctx); err != nil {
		return fmt.Errorf("failed to ensure namespace: %w", err)
	}

	var errs []error

	if err := w.syncContainerNamespace(ctx); err != nil {
		errs = append(errs, fmt.Errorf("failed to sync container namespace: %w", err))
	}
	if err := w.syncAllContainers(ctx); err != nil {
		errs = append(errs, fmt.Errorf("failed to sync containers: %w", err))
	}
	if err := w.syncAllImages(ctx); err != nil {
		errs = append(errs, fmt.Errorf("failed to sync images: %w", err))
	}
	if err := w.syncAllVolumes(ctx); err != nil {
		errs = append(errs, fmt.Errorf("failed to sync volumes: %w", err))
	}

	log.Info("Full sync complete")
	return errors.Join(errs...)
}

Deleting the namespace also destroys every mirror resource without giving the mirror finalizer a chance to forward deletes to Docker, which is a correctness issue the design otherwise takes care to preserve.

Fix: add Watches(&corev1.Namespace{}, ...) filtered to rancher-desktop and Watches(&containersv1alpha1.ContainerNamespace{}, enqueueApp), then in the steady-state reconcile path (after startWatcher) unconditionally call w.ensureNamespace(ctx) and w.syncContainerNamespace(ctx). Both are idempotent SSA applies with Get-guards, so no-op cost is negligible.

I3. Engine controller registered under API group app but depends on containers CRDs Codex¶

pkg/controllers/app/engine/controller.go

		return
	}
	base.RegisterController(newController())
}

const (
	controllerName = "engine"
	// apiGroup is "containers" because the reconciler watches
	// Container, Image, and Volume from that group and needs
	// their CRDs at startup. Grouping engine with its dependencies
	// keeps --controllers selections from splitting the two apart.
	apiGroup = "containers"
)

type controller struct{}

func newController() base.Controller {
	return &controller{}
}

var _ base.Controller = &controller{}

func (c *controller) GetName() string {
	return controllerName
}

func (c *controller) GetAPIGroup() string {
	return apiGroup
}

func (c *controller) GetCRDData() string {
	return ""
}

shouldEnableController selects controllers by either name or API group. Starting with --controllers=app therefore enables the engine controller without enabling the containers API group. But the engine reconciler's SetupWithManager calls Watches(&containersv1alpha1.Container{}, ...), Image, and Volume — if those CRDs aren't installed, the manager start fails (the watch cache can't establish) or the reconciler endlessly re-attempts.

The controller also has no CRDs of its own (GetCRDData() returns ""), so the dependency is purely on another controller group's CRDs. That relationship is invisible to shouldEnableController and to the --controllers help text.

cmd/app-controller/main.go:12-20 imports the engine package into the standalone app-controller binary too, which means the standalone binary now also needs the containers API to be installed elsewhere, or it will fail at manager start.

cmd/app-controller/main.go

package main

import (
	"os"

	// Import app controller packages to trigger init() functions.
	_ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/app/app"
	_ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/app/demo"
	_ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/app/engine"
	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/external"
)

func main() {
	os.Exit(external.RunControllers("app"))
}

Fix options, in order of preference:

Move the engine controller into the containers group (rename apiGroup = "containers"). It mirrors Docker state into the containers API, so this is where it conceptually belongs. This also makes the dependency implicit.
Teach shouldEnableController to auto-enable cross-group dependencies, or refuse startup with a clear error if engine is selected without containers.
At minimum, document the dependency in the --controllers help text.

I4. Nil-pointer dereference on inspect.Config.Labels for containers and images Claude¶

pkg/controllers/app/engine/controllers/sync_containers.go

	// the same way.
	var labels map[string]string
	if inspect.Config != nil {
		labels = inspect.Config.Labels
	}

	// Build status.
	statusApply := containersv1alpha1apply.ContainerStatus().
		WithName(name).
		WithNamespace(namespace).
		WithPath(inspect.Path).

// sync_images.go:157-163
statusApply := containersv1alpha1apply.ImageStatus().
    WithID(inspect.ID).
    WithRepoDigests(digests...).
    WithArchitecture(inspect.Architecture).
    WithOS(inspect.Os).
    WithSize(inspect.Size).
    WithLabels(inspect.Config.Labels)

mobycontainer.InspectResponse.Config is *container.Config and mobyimage.InspectResponse.Config is *dockerspec.DockerOCIImageConfig (verified in moby api@v1.54.1). Both are pointers that can be nil — the container path also explicitly guards inspect.NetworkSettings a few lines down (sync_containers.go:182), showing the authors already know these fields are nilable.

pkg/controllers/app/engine/controllers/sync_containers.go

	}
	if t, err := time.Parse(time.RFC3339Nano, inspect.State.StartedAt); err == nil && !t.IsZero() {
		statusApply.WithStartedAt(metav1.NewTime(t))
	}
	if t, err := time.Parse(time.RFC3339Nano, inspect.State.FinishedAt); err == nil && !t.IsZero() {
		statusApply.WithFinishedAt(metav1.NewTime(t))
	}

	// Sort port bindings by name before applying. Go map iteration is
	// randomized, and even with +listType=map the entries land in the
	// stored object in the order we send them — a stable order avoids

Nil Config is unlikely for a freshly inspected container but possible for legacy/multi-platform images or when a Docker-side bug corrupts an image record. A panic in the Docker watcher goroutine is partially caught by the recover at docker_watcher.go:144-149, but that recover only logs and returns — the watcher goroutine exits, the reconciler detects a dead watcher and reconnects, and the panic rechecks the same input on the next sync, reproducing the crash. That's an infinite reconnect loop on a single bad image, not a clean failure.

pkg/controllers/app/engine/controllers/docker_watcher.go

	// the reconciler could wake, see alive()==true on the about-to-exit
	// goroutine, and skip the restart.
	defer w.enqueueReconcile()
	defer close(w.done)
	// Keep a bad event shape from crashing the whole app-controller.
	defer func() {
		if r := recover(); r != nil {
			log.Error(nil, "panic in Docker watcher goroutine",
				"recovered", r, "stack", string(debug.Stack()))
		}
	}()

	eventFilter := dockerclient.Filters{}.
		Add("type", string(events.ContainerEventType)).
		Add("type", string(events.ImageEventType)).
		Add("type", string(events.VolumeEventType))

Fix: guard both accesses. The pattern used elsewhere in this file is a local nil-check before building the apply config:

var labels map[string]string
if inspect.Config != nil {
    labels = inspect.Config.Labels
}
// ...WithLabels(labels)

Apply the same pattern in imageStatusFromInspect.

Suggestions ¶

S1. Webhook state-enum check duplicates the CRD schema Claude Gemini¶

pkg/controllers/containers/container/container_webhooks.go


// ValidateUpdate implements [ctrlwebhookadmission.Validator].
// Only spec.state changes are allowed; all other spec fields are immutable.
// The CRD schema already constrains spec.state to the valid enum
// ({created, running, unknown}), so this webhook only enforces that
// nothing else in the spec changes.
func (c *immutableValidator) ValidateUpdate(_ context.Context, oldContainer, newContainer *v1alpha1.Container) (warnings ctrlwebhookadmission.Warnings, err error) {
	oldCopy := oldContainer.Spec
	oldCopy.State = newContainer.Spec.State
	if !equality.Semantic.DeepEqual(oldCopy, newContainer.Spec) {
		return nil, fmt.Errorf("only spec.state may be changed: old: %v, new: %v", oldContainer.Spec, newContainer.Spec)
	}
	return nil, nil
}

spec.state is declared with +kubebuilder:validation:Enum=created;running;unknown (container_types.go:69), and the CRD schema runs before validating admission webhooks. Any out-of-enum value is rejected with a 400 before this webhook executes, so the switch is unreachable.

pkg/apis/containers/v1alpha1/container_types.go

	// "created" is satisfied by any non-running state (created,
	// exited, dead).
	//
	// +required
	// +kubebuilder:default:=unknown
	// +kubebuilder:validation:Enum=created;running;unknown
	State ContainerStatusValue `json:"state"`
}

// ContainerStatus defines the observed state of the container.
type ContainerStatus struct {

Gemini rated this Important; the code is dead but harmless, so Suggestion fits better.

Fix: drop the switch. The spec-normalisation DeepEqual check is the only work the webhook needs to do for state transitions.

S2. engine.bats uses $(rdd ...) substitution at file scope Claude Gemini¶

bats/tests/32-app-controllers/engine.bats

# Engine controller tests — verify that the engine controller mirrors Docker
# containers, images, and volumes into Container, Image, and Volume
# resources, and that deletions and spec.state changes are forwarded
# to Docker.

NAMESPACE="rancher-desktop"
DOCKER_HOST="unix://$(rdd svc paths docker_socket)"
export DOCKER_HOST

local_setup_file() {
    # The Docker socket access pattern used by these tests is not yet wired
    # up for Windows/WSL2.
    skip_on_windows

DOCKER_HOST is assigned via command substitution at file scope, which runs during the BATS parse phase — before skip_on_windows in local_setup_file, and before run -0 can surface a clean error. Per the repo's BATS style rules, command substitution outside run hides failures in the empty string. On Windows (where the test is meant to skip) this still executes rdd svc paths docker_socket, which is wasted work at best and a misleading failure at worst.

Gemini rated this Important; Claude rated it a Suggestion, and style-violation severity is better classed as Suggestion.

Fix: move into local_setup_file, after skip_on_windows:

local_setup_file() {
    skip_on_windows
    run -0 rdd svc paths docker_socket
    DOCKER_HOST="unix://${output}"
    export DOCKER_HOST
    rdd svc delete
    rdd set running=true
}

S3. Docker watcher goroutine bound to the Reconcile context Gemini¶

pkg/controllers/app/engine/controllers/engine_reconciler.go

			}
		}
		return ctrl.Result{}, r.setEngineCondition(ctx, &app, terminalStatus, terminalReason, terminalMessage)
	}

	if !watcherRunning {
		log.Info("App is running, starting Docker watcher")
		if err := r.startWatcher(ctx); err != nil {
			log.Error(err, "Failed to start Docker watcher")
			if condErr := r.setEngineCondition(ctx, &app, metav1.ConditionFalse, "ConnectFailed", err.Error()); condErr != nil {
				log.Error(condErr, "Failed to update ContainerEngineReady to ConnectFailed")
			}
			return ctrl.Result{}, err
		}
	}

	// Re-assert Connected on every steady-state reconcile so its
	// observedGeneration tracks the current App generation. When the
	// spec is patched without cycling the watcher (e.g. `rdd set
	// running=true kubernetes.enabled=true`), this is the only place

r.startWatcher(ctx) is called with the ctx passed into Reconcile, and newDockerWatcher derives watchCtx from it. Currently controller-runtime does not per-request cancel a Reconcile ctx (it's derived from the manager ctx), so the watcher goroutine survives a Reconcile return. But it's a latent coupling: any future ReconciliationTimeout, per-request deadline, or middleware that adds a cancel would instantly kill the watcher when Reconcile returns, and the reconciler would immediately restart it, cycling forever.

Fix: decouple the watcher from the request ctx with context.WithoutCancel:

w, err := newDockerWatcher(context.WithoutCancel(ctx), r.Client, r.reconcileChan)

This preserves the logger and other context values but isolates cancellation.

Design Observations ¶

Strengths ¶

SSA field-owner separation (controllerName vs finalizerFieldOwner) (in-scope) Claude Gemini — The insight that a finalizer-only SSA apply through the main field owner would release ownership of spec and fail CRD required-field validation is non-obvious and correctly addressed by running finalizer-only applies through a distinct field manager. Worth documenting as a pattern for other controllers.
Daemon-clock anchoring for Docker events (in-scope) Claude — dockerEventsSince uses Info.SystemTime instead of time.Now() for the Since filter, avoiding host/guest clock skew that would silently drop events inside the skew window. The fallback to a 2-minute-biased host clock is a sensible safety net for Info failures.
Idempotent fullSync preserves mirror identities (in-scope) Claude — Container IDs, image hashes, and volume hashes are stable across watcher reconnects, so transient Docker disconnects cause no resource churn for downstream clients. Combined with the alive() detection + reconnect path in the reconciler, this handles daemon restarts cleanly.
observedGeneration gating in rdd set (in-scope) Claude Codex Gemini — Filtering condition snapshots by gen >= minGen rejects stale ContainerEngineReady=True from a previous backend that would otherwise prematurely satisfy the wait. Both EngineReconciler and AppReconciler stamp ObservedGeneration consistently on updates, which is what makes the filter reliable.
Stable sort before SSA apply on atomic lists (in-scope) Gemini — ContainerPort bindings, port names, and RepoDigests are sorted before each apply, avoiding spurious resourceVersion churn caused by Go's randomized map iteration order combined with atomic list semantics.
retry.RetryOnConflict on cross-controller App status writes (in-scope) Claude — Both AppReconciler.Reconcile and EngineReconciler.setEngineCondition wrap their App status updates in retry + re-Get, which is the correct response to two controllers writing different conditions on the same object outside controller-runtime's per-controller serialisation.

Concerns ¶

Per-reconcile List cost for container specs and finalizers (future) Claude — reconcileContainerSpecs and processFinalizers each List all Containers/Images/Volumes per reconcile. Every watched child event triggers a reconcile, so total cost is O(N) per child event. The engine_reconciler.go:181-185 comment already flags this and names the long-term fix (per-resource reconcilers with watch predicates on deletionTimestamp set and spec.state changed). Worth prioritising once mirror counts exceed ~100 per type, but fine as-is for dev workstations.

Testing Assessment ¶

Test coverage is strong for the new functionality. The new BATS suite bats/tests/32-app-controllers/engine.bats covers image pull/untag/dangling flows, container lifecycle (create/stop/rm), volumes with RFC-1123-invalid names, bidirectional deletion (K8s→Docker and Docker→K8s), spec.state transitions (running↔created, pause handling), the non-moby NotApplicable path, and rdd set wait behaviour (already-stopped, --timeout=0). Unit tests cover parseContainerName and the InitDiscovery/MarkControlPlaneReady lifecycle.

Codex ran go test ./pkg/controllers/app/engine/controllers ./pkg/service/controllers ./pkg/controllers/containers/container ./pkg/controllers/app/app/controllers during its review; all four packages passed.

Untested scenarios ranked by risk:

rdd set running=true on Windows (the regression in I1). BATS skips Windows, so no CI gate catches it.
--controllers=app startup without the containers group (I3). No selective-start test exercises this.
Deleting namespace/rancher-desktop or containernamespace/moby while the app is running (I2). No self-healing test.
Docker daemon disconnect during the event stream (watcher death → reconnect cycle). Only implicitly covered through VM stop/start.
A corrupt/legacy image with nil Config (I4). Can't easily be simulated against a real Docker engine; a unit-level test against a hand-built inspect response would be cheap and catch the regression.

Documentation Assessment ¶

Design docs are updated comprehensively: api_app.md (new ContainerEngineReady condition), api_containers.md (engine mirroring, terminology, spec.state, finalizer lifecycle), cmd_app.md (--timeout, wait semantics), and discovery.md (ready annotation). The terminology section clarifying capitalized K8s resource names vs lowercase Docker engine objects is a useful addition.

Two gaps surfaced by the review:

docs/design/cmd_app.md and docs/design/api_app.md describe running=true as waiting on ContainerEngineReady without noting the Windows no-writer caveat (I1).
The --controllers help text in pkg/service/controllers/options.go:32 does not describe the app→containers dependency edge (I3).

Acknowledged Limitations ¶

engine_reconciler.go:181-185 — Per-reconcile List calls for specs and finalizers are O(N) per child event. Long-term fix is per-resource reconcilers with watch predicates.
cmd/rdd/set.go:460-463 — rdd set does not wait on property changes that trigger a VM restart without changing running (for example, containerEngine.name alone). Distinct from I1.
docker_watcher.go:179-190 — Transient handleEvent errors for image pull and volume create are logged and dropped; the mirror stays missing until the next full resync (e.g., VM cycle). A periodic fullSync tick is explicitly deferred.
sync_images.go:239-243 — removeImagesByID matches on status.id; a narrow orphan window exists if the metadata Apply succeeded but the status Apply failed transiently. Self-heals on the next watcher restart via syncAllImages.
engine/controller.go:23-29 — Engine controller registration skipped on Windows until WSL2 socket access is wired up. Interacts with I1.
engine_reconciler.go:136-147 — ContainerEngineReady name applies to moby only; will be renamed when containerd mirroring lands.

Agent Performance Retro ¶

Claude ¶

Claude delivered the most thorough review: it caught the nil-pointer on inspect.Config.Labels that neither other agent found, verified it against the moby types at the right module version, and produced the most complete Coverage Summary with per-file "Trivial" vs "Reviewed, no issues" classifications. Its SSA field-owner strength call and the observedGeneration gating call showed the deepest engagement with the finalizer/multi-writer design. Two of its five findings overlap with Gemini at the same severity, which is a good signal of shared ground truth. It missed the Windows wait regression and the cross-group dependency — the two biggest structural issues Codex caught. Calibration was appropriate (one Important, four Suggestions, nothing inflated).

Codex ¶

Codex delivered the two highest-impact structural findings — the Windows wait regression and the app→containers cross-group dependency — which nobody else saw. It was the only agent that ran go test on the affected packages to verify the unit tests still pass, and it correctly linked the namespace self-healing gap to the watch set. Its retro was the sharpest at connecting findings to testing gaps. The severity calibration on C1 (Windows hang) is defensible but we downgraded to Important because Windows is not currently a shipping target; worth noting that Codex's threshold for "Critical" is UX regression on a tier-1 platform, which is reasonable. It missed the nil-pointer on inspect.Config, which was the most concrete local defect.

Gemini ¶

Gemini hit the Google backend capacity limits (429 errors) partway through the run and retried with backoff, but still produced a structured review on time. It independently landed on two findings Claude also caught (webhook dead code, BATS substitution) and added the latent watcher-context decoupling observation. The latter is a real gap Claude and Codex missed, but Gemini inflated all three of its findings to Important despite two being style/cosmetics and one being latent — calibration was too hot. Per the noted behaviour, Gemini skipped git blame, so it could not distinguish regressions from pre-existing issues; Claude and Codex covered that dimension.

Summary ¶

	Claude Opus 4.6	Codex GPT 5.4	Gemini 3.1 Pro
Duration	8m 02s	7m 57s	13m 02s
Findings	1I 3S	3I	3S
Tool calls	45 (Read 23, Grep 11, Bash 10)	86 (shell 85, stdin 1)	21 (runshellcommand 14, grepsearch 4, readfile 3)
Design observations	6	2	3
False positives	0	0	0
Unique insights	1	3	1
Files reviewed	38	38	38
Coverage misses	0	0	0
Totals	1I 3S	3I	3S
Downgraded	0	1 (I→S)	3 (I→S)
Dropped	0	0	0

Overall, Claude and Codex provided complementary value: Claude caught the local nil-deref and gave the cleanest coverage picture; Codex caught the structural regressions (Windows wait, cross-group dependency, namespace self-healing). Gemini contributed one unique latent finding and a second opinion on two Claude suggestions, at the cost of inflated severities. For a three-agent run on a 162 KB diff this was a healthy spread — the three severity votes per finding let us calibrate confidently.

Reconciliation — Codex P1 rdd set running=true Windows hang: critical → important I1 (Windows is not yet a shipping target, so it's a UX regression rather than a correctness ship-stopper). Gemini P1 watcher-context coupling: important → suggestion S3 (latent, not currently triggered). Gemini P1 BATS substitution: important → suggestion S2 (style violation, no runtime impact). Gemini P1 webhook dead code: important → suggestion S1 (unreachable but harmless). No findings were dropped as false positives.

Review Process Notes ¶

Skill improvements ¶

Add a dimension for cross-group / cross-controller dependency checking. When a controller registers under API group X but its SetupWithManager watches types from group Y, the selectable startup modes have an invisible edge: --controllers=X starts the controller without the CRDs it depends on. Recognise the pattern in code by reading each controller's apiGroup (or equivalent group identifier) and checking that every type the reconciler watches is declared in the same group, or that the dependency is made explicit in controller selection logic. Flag when they diverge. Why: controllers within multi-group managers commonly fail at startup only when a user picks a non-default controller subset, and default-path tests never exercise that combination.

Repo context updates ¶

(none — deep-review-context.md was accurate for this review)