Deep Review: 20260415-015141-pr-309

Date2026-04-15 01:51
Reporancher-sandbox/rancher-desktop-daemon
Round4 (of PR #309)
Author@jandubois
PR#309 — engine: mirror Docker container engine state into Kubernetes
Branchengine-v3
Commitsa5235c8 engine: mirror Docker container engine state into Kubernetes
ReviewersClaude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro
VerdictMerge with fixes — three important regressions introduced by prior-round fixes: a discovery ordering race that re-opens the rdd set running=true gap on cold start, engine cleanup that deletes unowned mirror resources, and the standalone app-controller binary importing engine under the wrong manager group.
Wall-clock time16 min 6 s


Executive Summary

Round 4 of PR #309 reviews the fixes applied after round 3. Four of the five prior findings were addressed cleanly (nil-pointer guards on inspect.Config, engine controller moved to the containers API group, webhook dead code removed, watcher context decoupled via context.WithoutCancel). One was skipped by design (namespace self-healing) and one remains a standing style disagreement (BATS $(...) substitution at file scope).

This round surfaces three new Important regressions, each a direct consequence of a round-3 fix interacting with the rest of the control plane. (1) rdd.rancherdesktop.io/ready=true now fires before registerDiscovery populates the enabled-controller list, so on cold start isEngineControllerEnabled sees an empty list and rdd set running=true silently falls through to the weaker Running=True wait — exactly reintroducing the race the PR aims to close. (2) cleanupMirrorResources and the full-sync prune paths operate on every Container/Image/Volume in rancher-desktop, not only on engine-owned mirrors, which forced the containers-mock.bats suite to spell containers,-engine to keep the two controllers from fighting over shared kinds. (3) cmd/app-controller/main.go now imports engine, but pkg/external/main.go does not filter by GetAPIGroup(), so a standalone app-controller run can register a containers-group reconciler under the app shared manager without the containers CRDs it needs.

Four lower-severity items cover a latent watcher-goroutine leak on manager shutdown (the context.WithoutCancel fix from round 3 has no shutdown hook), missing teardown cleanup in engine.bats, a stylistic errs := []error{applyErr} seed, and stale --controllers help text that now omits the containers group. Gemini filed a Critical "invalid @test syntax" finding against engine.bats that is a hallucination — the file uses @test correctly — and an Important "API-group mismatch" claim whose cause-and-effect trace is backwards from what pkg/external/main.go actually does.

Structure: 0 critical, 3 important, 4 suggestions.


Critical Issues

None.


Important Issues

I1. rdd set running=true cold-start race: ReadyAnnotation fires before the enabled-controller list is populated Codex Claude
	log.Info("Installing CRDs for all controllers in parallel", "controllers", len(scm.registrations))
	if err := scm.installControllerCRDs(ctx); err != nil {
		return fmt.Errorf("failed to install controller CRDs: %w", err)
	}

	// Configure controller-runtime to use klog
	ctrllog.SetLogger(log.WithName(fmt.Sprintf("controller-runtime-%s", scm.name)))

	// Create and register scheme with required types
	managerScheme := runtime.NewScheme()
	if err := scheme.AddToScheme(managerScheme); err != nil {
		return fmt.Errorf("failed to add core scheme: %w", err)
	}
	// Add CRD types to support dynamic resource discovery
	if err := apiextensionsv1.AddToScheme(managerScheme); err != nil {
		return fmt.Errorf("failed to add apiextensions scheme: %w", err)
	}

SharedControllerManager.Start installs CRDs, then sets ReadyAnnotation, and only later — at manager.go:202 — calls registerDiscovery, which is what writes the EnabledControllers list into configMap.Data.


	// Mark the control plane ready after registerDiscovery so clients
	// waiting on the ready annotation see both CRDs installed and
	// controller registration written, not just CRDs.
	if err := MarkControlPlaneReady(ctx, scm.discovery.client); err != nil {
		return fmt.Errorf("failed to mark control plane as ready: %w", err)
	}

	// Ensure cleanup on shutdown with a timeout to avoid blocking if apiserver is gone
	defer func() {
		cleanupCtx, cleanupCancel := context.WithTimeout(context.Background(), 5*time.Second)
// set.go:490-505
engineEnabled, err := isEngineControllerEnabled(ctx, config)
if err != nil {
    logrus.WithError(err).Debug("Falling back to ContainerEngineReady wait after discovery lookup failed")
    engineEnabled = true
}
if engineEnabled {
    logrus.Info("Waiting for container engine to be ready")
    return watchCondition(ctx, config, func(obj *unstructured.Unstructured) bool {
        status, gen, present := conditionInfo(obj, "ContainerEngineReady")
        return present && gen >= minGen && status == metav1.ConditionTrue
    })
}
logrus.Info("Waiting for the VM to start (engine controller not enabled)")
return watchCondition(ctx, config, func(obj *unstructured.Unstructured) bool {
    status, gen, present := conditionInfo(obj, "Running")
    return present && gen >= minGen && status == metav1.ConditionTrue
})

rdd set running=true calls waitForDesiredStateisEngineControllerEnabled, which reads configMap.Data via GetEnabledControllers (discovery.go:270-289) and returns (false, nil) for a ConfigMap with no data entries. In the window between MarkControlPlaneReady and registerDiscovery, the predicate is: ready annotation set, but EnabledControllers empty. rdd svc start --wait and cmd/rdd/service.go already wait on the ready annotation, so rdd set proceeds into the wait, sees engineEnabled=false, and falls through to the weaker Running=True wait.

	return &info, nil
}

// GetEnabledControllers returns the list of all enabled controllers, across all
// controller managers.  Note that the returned controllers may not be running.
func (d *ControllerManagerDiscovery) GetEnabledControllers(ctx context.Context) ([]string, error) {
	configMap, err := d.client.CoreV1().ConfigMaps(d.namespace).Get(ctx, ControllerManagerConfigMapName, metav1.GetOptions{})
	if errors.IsNotFound(err) {
		return nil, nil // No controller manager running
	}
	if err != nil {
		return nil, fmt.Errorf("failed to discover controller manager: %w", err)
	}

	enabledControllers := make([]string, 0) // non-nil: distinguishes "0 controllers" from "ConfigMap not found"
	for _, serializedInfo := range configMap.Data {
		var info ControllerManagerInfo
		if err := json.Unmarshal([]byte(serializedInfo), &info); err != nil {
			return nil, fmt.Errorf("failed to parse controller manager info: %w", err)
		}

		enabledControllers = append(enabledControllers, info.EnabledControllers...)
	}

	return enabledControllers, nil
}

// IsControllerRunning checks if a specific controller is running in any of the
// shared controller managers.
func (d *ControllerManagerDiscovery) IsControllerRunning(ctx context.Context, controllerName string) (bool, *ControllerManagerInfo, error) {

Running=True is satisfied by LimaVM reaching Running — well before the engine controller has connected to Docker, run fullSync, and stamped ContainerEngineReady=True. That is exactly the race this PR is designed to close. engine.bats never catches it because its tests either start the service with rdd set running=true against an already-ready control plane or run after startup has fully settled.

Codex originally traced this as Important; Claude independently flagged the same ordering (as S1 "isEngineControllerEnabled queries discovery before controller managers register") and proposed defaulting to ContainerEngineReady when the controller list is empty-during-startup. We keep the Important classification because the failure mode reintroduces a ship-blocking behaviour from before the PR.

Fix options, in order of preference:

  1. Move registerDiscovery before MarkControlPlaneReady in SharedControllerManager.Start. The enabled-controller list is then visible to every client that waits for the ready annotation. This is the minimal change and preserves the "ready = clients may use the control plane" contract.
  2. If keeping readiness independent of controller-manager health is load-bearing, precompute the enabled-controller list into the discovery ConfigMap (or a sibling key) before MarkControlPlaneReady, and have isEngineControllerEnabled read that precomputed record.
  3. As a defensive change on top, let isEngineControllerEnabled distinguish "empty list" from "registered with zero engine" and default to the stricter ContainerEngineReady wait when the list is empty. This alone is not sufficient — it papers over the ordering — but it is a belt-and-braces safeguard.
I2. Engine cleanup and full-sync prune delete every Container/Image/Volume in rancher-desktop, not only engine-owned mirrors Codex
	}
	if err := r.deleteAllOfType(ctx, &containersv1alpha1.ImageList{}); err != nil {
		errs = append(errs, fmt.Errorf("failed to delete Images: %w", err))
	}
	if err := r.deleteAllOfType(ctx, &containersv1alpha1.ContainerNamespaceList{}); err != nil {
		errs = append(errs, fmt.Errorf("failed to delete ContainerNamespaces: %w", err))
	}
	return errors.Join(errs...)
}

// deleteAllOfType lists and removes every resource of the given kind
// in apiNamespace. Engine is authoritative for Container, Image, and
// Volume in `rancher-desktop`: this loop deletes every object, not
// just engine-owned mirrors. Coexistence with another writer to these
// kinds requires an engine-owned label and matching filters here and
// in the sync_*.go full-sync prune paths. Finalizer removal retries
// on conflict; per-item errors are collected so one stuck object does
// not block the rest.
func (r *EngineReconciler) deleteAllOfType(ctx context.Context, list client.ObjectList) error {
	if err := r.List(ctx, list, client.InNamespace(apiNamespace)); err != nil {
		return err
	}

	items, err := meta.ExtractList(list)
	if err != nil {
		return err
	}

	var errs []error
	for _, item := range items {
		obj := item.(client.Object)
		key := client.ObjectKeyFromObject(obj)
		// meta.ExtractList leaves each item's TypeMeta empty (GVK is
		// only on the list), so format the Go type with %T instead of
		// obj.GetObjectKind().
		retryErr := retry.RetryOnConflict(retry.DefaultRetry, func() error {
			latest := obj.DeepCopyObject().(client.Object)
			if err := r.Get(ctx, key, latest); err != nil {
				if apierrors.IsNotFound(err) {
					return nil
				}
				return err
			}
			if !controllerutil.RemoveFinalizer(latest, mirrorFinalizer) {
				return nil
			}
			return r.Update(ctx, latest)
		})
		if retryErr != nil {
			errs = append(errs, fmt.Errorf("failed to remove finalizer from %T %s: %w",
				obj, obj.GetName(), retryErr))
			continue
		}
		// Delete uses the stale list item, not latest from the retry

When cleanupMirrorResources runs (App stopped, backend switched, controller teardown), deleteAllOfType lists every object of a given kind in the rancher-desktop namespace. For each object, it enters a retry closure that removes the engine mirrorFinalizer if present — but the closure exits with nil on "finalizer not present" (line 319). The outer loop then falls through to r.Delete(ctx, obj) unconditionally at line 331. An object created by any other controller in rancher-desktop, lacking the engine finalizer, is deleted anyway.

The full-sync prune paths have the same shape:

// sync_containers.go:62-75
var containerMirrors containersv1alpha1.ContainerList
if err := w.k8s.List(ctx, &containerMirrors, client.InNamespace(apiNamespace)); err != nil {
    return fmt.Errorf("failed to list Containers: %w", err)
}
for i := range containerMirrors.Items {
    c := &containerMirrors.Items[i]
    if !dockerIDs[c.Name] {
        log.V(1).Info("Removing stale Container", "id", c.Name)
        if err := w.removeMirrorResource(ctx, c, c.Name); err != nil {
            errs = append(errs, err)
        }
    }
}

Any Container whose metadata.name does not correspond to a live Docker container ID is pruned, regardless of whether it was created by the engine controller. sync_images.go:220-239 and sync_volumes.go apply the same logic to their kinds. The fallout is already visible in this PR: bats/tests/34-containers-controllers/containers-mock.bats had to switch to --controllers=containers,-engine so that the mock container controller keeps exclusive ownership of the mirror kinds while its tests run. That is a test-only workaround for an architectural problem: two controllers in the same group both write authoritative state into the same namespace and the same kinds, with no ownership marker distinguishing their objects.


	freshNames, applyErr := w.applyImageFromInspect(ctx, result.InspectResponse)
	keep := make(map[string]bool, len(freshNames))
	for _, n := range freshNames {
		keep[n] = true
	}

	var images containersv1alpha1.ImageList
	if err := w.k8s.List(ctx, &images, client.InNamespace(apiNamespace)); err != nil {
		return errors.Join(applyErr, err)
	}
	var errs []error
	if applyErr != nil {
		errs = append(errs, applyErr)
	}
	for i := range images.Items {
		img := &images.Items[i]
		if img.Status.ID != result.InspectResponse.ID {
			continue
		}
		if keep[img.Name] {
			continue
		}
		if err := w.removeMirrorResource(ctx, img, img.Name); err != nil {
			errs = append(errs, err)
		}
	}
	return errors.Join(errs...)
}

Fix: stamp engine-owned objects with a distinguishing label or annotation (e.g. engine.rancherdesktop.io/owned-by=engine-controller) in addition to the finalizer, and filter both deleteAllOfType and every full-sync prune by that marker. Unmarked objects must be skipped entirely so other controllers — the mock used by containers-mock.bats today, any future containerd mirror, or a human-created test fixture — can coexist in the same namespace without losing state on a VM stop or backend switch. Deleting by marker also removes the need for the -engine carve-out in containers-mock.bats.

Two earlier rounds discussed the related "namespace self-healing" question and decided not to re-create rancher-desktop / containernamespace/moby after a user deletes them mid-run. That decision is distinct from this one: self-healing is about re-creating your own objects, while this finding is about not destroying someone else's.

I3. cmd/app-controller imports the engine package, but external.RunControllers does not filter by API group — the standalone binary can register engine under the wrong shared manager Codex

import (
	"os"

	// Import app controller packages to trigger init() functions.
	_ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/app/app"
	_ "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/controllers/app/demo"
	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/external"
)

func main() {
	os.Exit(external.RunControllers("app"))
}

cmd/app-controller/main.go now blank-imports engine to trigger its init(). The intent of round 3 was the opposite: engine declares itself in the containers API group (pkg/controllers/app/engine/controller.go:32-38) precisely so --controllers=app selections cannot pull it in without its CRDs.

		return
	}
	base.RegisterController(newController())
}

const (
	controllerName = "engine"
	// apiGroup is "containers" because the reconciler watches
	// Container, Image, and Volume from that group and needs
	// their CRDs at startup. Grouping engine with its dependencies
	// keeps --controllers selections from splitting the two apart.
	apiGroup = "containers"
)

type controller struct{}

func newController() base.Controller {
// pkg/external/main.go:62-72
for _, controller := range base.GetAllControllers() {
    shouldStart, err := shouldStartController(ctx, config, controller.GetName(), setupLog)
    // ...
    } else if shouldStart {
        controllersToStart = append(controllersToStart, controller)
    }
}

RunControllers iterates the global registry and filters only by shouldStartController, which checks whether the controller is already running in another manager. It never compares controller.GetAPIGroup() to the apiGroupName passed in at line 31. If the embedded control plane is running with engine disabled (--controllers=-engine), or if engine has not registered itself in discovery yet, shouldStartController returns true, and cmd/app-controller adds engine to its controllersToStart list. Engine is then registered into the shared manager named "app" at line 99 — not "containers" — which is exactly the scenario round 3 tried to prevent. Once in, engine's SetupWithManager watches Container/Image/Volume from the containers group; if those CRDs are not installed (they are not installed by app-controller itself — demo and app have no containers dependency), the watch caches fail to establish and the manager startup errors out.

Gemini filed a related finding but reversed the effect: it claimed external.RunControllers("app") filters by API group and would therefore silently omit engine. That would be a safer outcome than what the code actually does, but RunControllers does not filter. The engine controller is imported and registered under the wrong shared manager.

Fix options, in order of preference:

  1. Drop the _ "...controllers/app/engine" import from cmd/app-controller/main.go. The blank import contributes nothing the standalone binary needs now that engine belongs to a different group, and the other two app-group controllers continue to work. This is a one-line change.
  2. In pkg/external/main.go, add an if controller.GetAPIGroup() != apiGroupName { continue } guard to the controllers-to-start loop. This turns the filtering into a systemic guardrail: a future mistaken import in any standalone binary is prevented from cross-registering into the wrong manager. Pair with (1) for belt-and-braces.
  3. If the intent is to eventually run engine from a containers-specific standalone binary (cmd/containers-controller), add that binary and move the engine import there instead.

Suggestions

S1. context.WithoutCancel leaks the watcher goroutine on manager shutdown Claude
	//
	// Run both passes unconditionally and join their errors. Early
	// return on specsErr would stall every mirror's finalizer behind
	// a single stuck spec.state actuation: the next reconcile would
	// hit the same broken container first and skip finalizers again.
	var specsErr, finErr error
	if err := r.reconcileContainerSpecs(ctx); err != nil {
		specsErr = fmt.Errorf("failed to reconcile container specs: %w", err)
	}
	if err := r.processFinalizers(ctx); err != nil {
		finErr = fmt.Errorf("failed to process finalizers: %w", err)
	}
	return ctrl.Result{}, errors.Join(specsErr, finErr)
}

// startWatcher creates and starts a Docker watcher goroutine. The
// watcher inherits watcherCtx, which cancels only on manager
// shutdown; startWatcher deliberately drops Reconcile's ctx so a
// future ReconciliationTimeout or per-request deadline cannot kill
// the watcher the moment Reconcile returns.
func (r *EngineReconciler) startWatcher(_ context.Context) error {
	r.watcherMu.Lock()
	defer r.watcherMu.Unlock()

	if r.watcher != nil {
		return nil
	}

	w, err := newDockerWatcher(r.watcherCtx, r.Client, r.reconcileChan)
	if err != nil {
		return err
	}

The round-3 fix for "watcher bound to Reconcile context" correctly prevents per-request cancellation from killing the watcher, but it also prevents manager-shutdown cancellation from reaching the watcher. The only shutdown path for the watcher is stopWatcher, which is called from Reconcile. On manager shutdown, controller-runtime cancels the manager context, Reconcile stops being invoked, and stopWatcher never runs. The watcher goroutine stays alive — and holds the Docker client connection open — until the process exits moments later.

Practical impact is small: GracefulShutdownTimeout on the manager is 10 s (manager.go:161), process exit follows quickly, and the watcher goroutine dies with the process. But cli.Close() never runs, the watcher's shutdown logs are missing, and any future refactor that wants ordered cleanup (flush pending events, fsync metadata) would have no hook to plug into. Claude rated this Important; we classify as Suggestion given the narrow shutdown-only impact, while noting it regresses the explicit intent of the round-3 fix.

	if scm.hasWebhookControllers() {
		// Use instance TLS directory for webhook certificates (persistent storage)
		webhookCertDir := instance.TLSDir()

		opts := webhook.Options{
			Host:     "127.0.0.1",
			Port:     scm.webhookPort,
			CertDir:  webhookCertDir,
			CertName: fmt.Sprintf("webhook-%s.crt", scm.name),
			KeyName:  fmt.Sprintf("webhook-%s.key", scm.name),
		}

Fix: register a manager.RunnableFunc with the shared manager in RegisterWithManager that listens on the manager context and calls stopWatcher on cancel. Or store the watchCancel alongside watcherMu and cancel it from a mgr.Add(manager.RunnableFunc(...)) runnable. Either pattern gives the watcher a proper shutdown hook while keeping its lifetime independent of any individual Reconcile ctx.

S2. engine.bats has no local_teardown_file — failed tests leak Docker containers/volumes into the next run Claude
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: SUSE LLC
# SPDX-FileCopyrightText: The Rancher Desktop Authors

load '../../helpers/load'

The new suite creates named Docker objects (test-lifecycle, test-state, test-delete, test-inuse, test-unpause, dangling-pin, dangling-tag-test:v1, test-vol, My_Vol_Ume) across many tests. A failure mid-suite leaves them running; subsequent runs error out with container name "test-lifecycle" is already in use. The rdd svc delete in local_setup_file destroys the VM and its Docker state, which is what catches this today — but only because the suite always performs a full VM recreate on every run. Dropping that rdd svc delete (an obvious future optimisation for test speed) would reintroduce the leaks immediately.

Fix: add a local_teardown_file that force-removes the named test objects:

local_teardown_file() {
    docker rm --force test-lifecycle test-state test-delete test-inuse \
        test-unpause dangling-pin 2>/dev/null || true
    docker rmi dangling-tag-test:v1 2>/dev/null || true
    docker volume rm test-vol My_Vol_Ume 2>/dev/null || true
}

The suite is currently self-cleaning only because of the VM-recreate in local_setup_file, which is a load-bearing side effect worth replacing with explicit teardown.

S3. errs := []error{applyErr} pre-seeds the slice with a possibly-nil error Claude

	var images containersv1alpha1.ImageList
	if err := w.k8s.List(ctx, &images, client.InNamespace(apiNamespace)); err != nil {
		return errors.Join(applyErr, err)
	}
	var errs []error
	if applyErr != nil {
		errs = append(errs, applyErr)
	}
	for i := range images.Items {
		img := &images.Items[i]

errors.Join at line 239 discards nils, so the behaviour is correct. The pattern reads as if every call returns at least one error, and diverges from the more common var errs []error; if applyErr != nil { errs = append(errs, applyErr) } shape used elsewhere in the same package. Pure style; change only if you are in the file anyway.

		}
		if keep[img.Name] {
			continue
		}
		if err := w.removeMirrorResource(ctx, img, img.Name); err != nil {
			errs = append(errs, err)
		}
	}
	return errors.Join(errs...)
}
S4. --controllers help text does not list the new containers group Codex
}

// CompletedOptions holds the completed controller configuration options.
type CompletedOptions struct {
	*completedOptions
}

// AddFlags adds the flags for the controller options to the given FlagSet.
func (o *Options) AddFlags(fs *pflag.FlagSet) {
	if o == nil {
		return

The round-3 fix moved engine from the app group to the containers group, but both help strings (options.go:32 and the equivalent line in cmd/rdd/service.go:174) still only document rdd and app. A user who reads --help has no indication that -engine, containers,-engine, or containers are valid selectors — and this PR introduces a containers-mock.bats test that already relies on the containers,-engine form.

	"k8s.io/client-go/util/keyutil"
	cliflag "k8s.io/component-base/cli/flag"
	"k8s.io/klog/v2"
	controlplaneapiserveroptions "k8s.io/kubernetes/pkg/controlplane/apiserver/options"
	"k8s.io/kubernetes/pkg/features"
	kubeoptions "k8s.io/kubernetes/pkg/kubeapiserver/options"
	"k8s.io/kubernetes/pkg/serviceaccount"

	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/instance"
	rddadmission "github.com/rancher-sandbox/rancher-desktop-daemon/pkg/service/admission"
	"github.com/rancher-sandbox/rancher-desktop-daemon/pkg/service/controllers"
	command := &cobra.Command{
		Use:  "create",
		Long: "Create RDD control plane",
		RunE: serviceCreateAction,
	}

	command.Flags().String("controllers", "*", controllers.ControllersFlagUsage)
	command.Flags().Int("secure-port", 0, "The port on which to serve HTTPS with authentication and authorization (default: 6443 + instance index)")
	if !developer.Mode() {
		_ = command.Flags().MarkHidden("controllers")
		_ = command.Flags().MarkHidden("secure-port")

Fix: update both help strings to list every registered API group, including containers and lima. A structural fix would derive the help text from the registered controllers at startup so it cannot drift again; a minimal fix is to just edit both strings and add a test that every selector used in bats/ appears in the help.


Design Observations

Strengths

Concerns


Testing Assessment

BATS coverage for the new functionality is solid: engine.bats exercises image pull/untag/dangling, container lifecycle, volume mirroring (including RFC-1123-invalid names), bidirectional deletion, spec.state transitions, the non-moby NotApplicable path, and rdd set wait semantics for already-stopped / --timeout=0. The new sync_containers_test.go unit test covers parseContainerName, and discovery_test.go covers the InitDiscovery/MarkControlPlaneReady lifecycle.

Untested scenarios surfaced by the findings above:

  1. Cold-start race on rdd set running=true (I1). engine.bats starts the service against an already-ready control plane or lets startup fully settle before asserting. A test that patches App the moment ReadyAnnotation lands — before registerDiscovery completes — would catch the ordering regression. Given the tight window, a unit-level test on waitForDesiredState against a mocked discovery ConfigMap is more practical.
  2. Non-engine resources in rancher-desktop (I2). No test creates a Container/Image/Volume without the engine finalizer and verifies that engine stop / full-sync leaves it alone. The containers-mock.bats suite has already been modified to exclude engine, which documents the problem instead of guarding against it.
  3. Standalone app-controller with the engine import (I3). No test runs cmd/app-controller with the embedded control plane disabled or with engine explicitly missing, so the misregistration path is unguarded.
  4. Watcher survives manager shutdown without cli.Close (S1). The missing shutdown hook is invisible at the unit-test level because the process exits anyway; an integration test that inspects /proc/self/fd before and after shutdown on Linux would surface the leaked Docker client connection.
  5. engine.bats teardown (S2). A forced-failure test that asserts the next run starts cleanly would catch the leak.

None of the reviewers ran go test this round; all analysis is static. The round-3 review ran go test ./pkg/controllers/app/engine/controllers ./pkg/service/controllers ./pkg/controllers/containers/container ./pkg/controllers/app/app/controllers against a prior revision and all four packages passed; re-running on a5235c8 would be worthwhile before merge.


Documentation Assessment

Design docs remain consistent with the implementation: api_app.md (new ContainerEngineReady condition), api_containers.md (engine mirroring + spec.state + finalizer lifecycle), cmd_app.md (--timeout and wait semantics, with the --controllers=-engine fallback caveat added in round 3), and discovery.md (ready annotation).

Two gaps remain after this round:

  1. --controllers help text in pkg/service/controllers/options.go:32 and cmd/rdd/service.go:174 does not mention the containers group — tracked as S4.
  2. docs/design/discovery.md describes the ready annotation contract but does not mention that EnabledControllers is populated after the annotation. Documenting that ordering would have helped this round catch I1 earlier, and clients that depend on the list should know to wait past just the annotation.

Acknowledged Limitations


Unresolved Feedback


Agent Performance Retro

Claude

Claude caught the cold-start race on rdd set running=true and filed it as S1, with a correct root-cause description but insufficient urgency — the fallback from ContainerEngineReady to Running=True is exactly the regression the PR exists to close. Codex filed the same code path at Important with a tighter file:line trace, and we promoted Claude's S1 into the consolidated I1. Claude's unique contribution was the context.WithoutCancel shutdown leak, a clean catch on a round-3 fix: it identified that the decoupling is too complete and that no manager-level cancellation path exists. Claude withdrew one self-filed finding (I3, stale-discovery cleanup) mid-review after re-reading the code, which is the right behaviour and worth rewarding. Its Coverage Summary was the most complete of the three, flagging per-file "Trivial" vs "Reviewed, no issues". Calibration was mostly sound though Claude's I1 (WithoutCancel) reads closer to Suggestion-severity; we downgraded.

Codex

Codex delivered the three highest-impact structural findings of the round and the only Important findings that survived consolidation unchanged. The ReadyAnnotation-vs-registerDiscovery ordering trace crosses four files (manager.go, set.go, discovery.go, service.go) and gets the causality right on the first try. The "engine deletes unowned objects" finding correctly links the code behaviour to the visible test-suite carve-out in containers-mock.bats, which is the kind of cross-file reasoning that separates a ship-blocking finding from a theoretical one. The app-controller import finding contradicts Gemini on the same code path and is right. Codex did not find the shutdown leak Claude caught. It did not run tests this round (prior rounds did). Calibration was correct on all three Important findings — no reconciliation adjustments needed.

Gemini

Gemini's output this round was net-negative. Its Critical finding (C1, invalid @test syntax) is a hallucination: the engine.bats file uses @test correctly, verified directly on disk. Its Important I1 (API group mismatch with app-controller silently omitting engine) is backwards — pkg/external/main.go does not filter by GetAPIGroup(), and Codex's finding on the same code path has the causality right. Its Important I2 (BATS $(...) substitution) is a re-flag of a standing-declined style issue. We dropped C1 and I1 as false positives and routed I2 to Unresolved Feedback. Gemini skipped git blame again (consistent with its quota-driven behaviour), so it could not distinguish regressions from pre-existing issues — which might be why it spent its analysis budget on hallucination-prone structural claims rather than on tracing the round-3 fixes. This is the second round where Gemini's calibration has been notably worse than the other two agents; worth watching whether the pattern is model-level or this-diff-specific.

Summary

Claude Opus 4.6Codex GPT 5.4Gemini 3.1 Pro
Duration3m 34s7m 26s6m 54s
Findings3S3I 1Snone
Tool calls35 (Read 19, Grep 9, Bash 7)78 (shell 77, plan 1)21 (grepsearch 12, runshellcommand 8, readfile 1)
Design observations543
False positives102
Unique insights130
Files reviewed393939
Coverage misses000
Totals3S3I 1Snone
Downgraded1 (I→S)00
Dropped303

Reconciliation — Claude P1 isEngineControllerEnabled race: suggestion → important I1 (promoted; shared with Codex and classified as a regression-reintroducing bug rather than a soft concern). Claude P1 context.WithoutCancel shutdown leak: important → suggestion S1 (downgraded; narrow shutdown-only impact and process exits soon after). Gemini P1 invalid @test syntax: critical → dropped as false positive (file verified to use @test correctly). Gemini P1 API-group mismatch: important → dropped as false positive (the causality trace contradicts pkg/external/main.go, which does not filter by GetAPIGroup()). Gemini P1 BATS $(...) substitution: important → dropped to Unresolved Feedback (standing-declined in prior rounds). Claude P1 image-name length (S3 in its review): dropped as false positive (DNS-1123 subdomain, 253-char limit applies). Claude P1 waitForDesiredState non-running TODO: dropped (acknowledged limitation, not a finding). Claude P1 stale-discovery cleanup: withdrawn by Claude itself during review.

Overall, Codex carried the round. Claude contributed one valuable unique finding and corroborated the highest-impact regression; Gemini did not usefully improve the output and cost orchestration time on verification. For a round-4 review of a bundle of targeted fixes, two reliable agents is the right spread — Gemini's hallucinations this round are a reminder that agent output must be verified against source before it reaches a report.


Review Process Notes

Skill improvements

Repo context updates