Deep Review: 20260416-120302-pr-316

Date	2026-04-16 12:44
Repo	rancher-sandbox/rancher-desktop-daemon
Round	2
Author	@jandubois
PR	#316 — engine: replace Container.spec.state with action annotations
Commits	`6251b90` engine: replace Container.spec.state with action annotations
Reviewers	Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro
Verdict	Merge with fixes — design is sound; I1 and I2 are genuine gaps worth addressing before merge.
Wall-clock time	`44 min 54 s`

Executive Summary ¶

The PR replaces Container.spec.state (a level-triggered desired-state field) with a one-shot containers.rancherdesktop.io/action annotation and a new status.lastAction outcome record. The motivation — Docker's restart policy and direct docker start/stop are concurrent writers that a level-trigger mirror cannot sanely own — is correct, and the resulting shape (empty spec, edge-triggered intent, persistent outcome) is a clean fit. The SSA simplification (dropping the engine-controller-finalizer field owner) falls out naturally once the spec is empty, and the pause/unpause pre-checks address the documented informer-cache replay hazard.

Three important gaps remain on the edges: the admission webhook validates the action annotation only on Update, leaving Create unguarded; patchContainerLastAction and removeActionAnnotation both surface NotFound as an error on self-healing races; and the Get-Patch sequence between patchContainerLastAction and removeActionAnnotation can race the informer cache, producing 409 loops that replay the Docker action. None is critical, but all three are worth fixing before merge.

Critical Issues ¶

None.

Important Issues ¶

I1. Action-annotation validation absent on Create Claude Codex¶

pkg/controllers/containers/container/container_controller.go

		Recorder: mgr.GetEventRecorder(ControllerName + "-controller"),
	}).SetupWithManager(mgr)
}

// set up the container controller with a webhook which prevents all modification.
func (c *controller) setupWebhookWithRuntimeConfig(mgr ctrl.Manager) error {
	mgr.GetLogger().Info("Setting up container webhook")
	mutatingConfig := base.WebhookConfig[*v1alpha1.Container]{
		Name:        "container-mutating",
		WebhookName: "container-mutating.containers.rancherdesktop.io",
		WebhookPort: c.webhookPort,
		Operations: []admissionregistrationv1.OperationType{
			admissionregistrationv1.Create,
			admissionregistrationv1.Update,
		},
		Validator: &immutableValidator{},
	}

	managers, err := base.SetupWebhookForResource(mgr, &v1alpha1.Container{}, mutatingConfig)
	if err != nil {
		return err

// container_webhooks.go
func (c *immutableValidator) ValidateCreate(context.Context, *v1alpha1.Container) (warnings ctrlwebhookadmission.Warnings, err error) {
    return nil, errors.New("webhook does not implement create")
}
// ValidateUpdate
if raw, ok := newContainer.Annotations[v1alpha1.AnnotationAction]; ok {
    if !v1alpha1.ContainerAction(raw).IsValid() {
        return nil, fmt.Errorf("invalid %s annotation %q", v1alpha1.AnnotationAction, raw)
    }
}

The webhook registers only for Update, so ValidateUpdate is the only enforcer of the action enum. A client that POSTs a hand-written Container with metadata.annotations["containers.rancherdesktop.io/action"]=bogus passes admission because ValidateCreate is never invoked. .spec is now optional (container_types.go:233-238), and the CRD no longer requires spec at the top level (crd.yaml:270-275, crd.yaml:493-500), so the barrier to a hand-written Container is lower than before the PR. The reconciler's !action.IsValid() fallback at docker_watcher.go:350 catches invalid values, but a valid action on a hand-written mirror would be dispatched against whatever container ID was put in metadata.name, bypassing the documented ContainerCreateRequest flow. The new BATS coverage at engine-docker.bats:366-387 only exercises the Update-reject path.

pkg/apis/containers/v1alpha1/container_types.go


	// Metadata is a standard object metadata
	//
	// +optional
	metav1.ObjectMeta `json:"metadata,omitempty,omitzero"`

	// Spec is reserved for future use. The Container API has no
	// desired-state fields today: actions are requested via the
	// AnnotationAction annotation on metadata instead.
	//
	// +optional
	Spec ContainerSpec `json:"spec,omitempty,omitzero"`

	// Status defines the observed state of Container
	//
	// +optional

pkg/controllers/app/engine/controllers/docker_watcher.go

// processContainerAction handles a container carrying the AnnotationAction
// annotation. It dispatches the Docker call, records the outcome in
// status.lastAction, and then removes the annotation.
//
// The Docker call runs before the status and metadata patches so that a
// crash mid-flight leaves the annotation in place and the next reconcile
// replays the action. Start, stop, pause, and unpause are idempotent
// against a container already in the target state, so replay is safe.
// Restart has no target state to match: a replay sends SIGTERM and waits
// the grace period a second time, which the controller cannot distinguish
// from a deliberate re-request.

bats/tests/32-app-controllers/engine-docker.bats

}

@test "invalid action annotation is rejected by the webhook" {
    # An unknown action value must fail admission so the reconciler never
    # sees a value it cannot process. Without this gate, a bad value would
    # fail to persist in status.lastAction (the CRD enum rejects it) and
    # the annotation would stay in place, wedging the retry loop.
    run_e -0 docker run -d --name test-invalid-action busybox sleep inf
    cid=${output}

    rdd ctl wait --for=jsonpath='{.status.status}'=running \
        --namespace="${RDD_NAMESPACE}" container/"${cid}" --timeout=30s

    run -1 rdd ctl annotate container "${cid}" --namespace="${RDD_NAMESPACE}" --overwrite \
        "containers.rancherdesktop.io/action=bogus"
    assert_output --partial "invalid"

    # The rejected request leaves status.lastAction untouched.
    run -0 rdd ctl get container "${cid}" --namespace="${RDD_NAMESPACE}" \
        -o jsonpath='{.status.lastAction}'
    refute_output

    docker rm --force test-invalid-action
}

@test "action annotation on create is rejected by the webhook" {
    # The engine watcher creates Container mirrors and never sets the
    # action annotation. Reject any hand-written create that carries one,
    # so it cannot drive a Docker action against its metadata.name.
    run -1 rdd ctl apply -f - <<EOF
apiVersion: containers.rancherdesktop.io/v1alpha1
kind: Container

Fix: register the validator for Create as well and implement ValidateCreate symmetrically with ValidateUpdate. If direct Container creates should be rejected outright, keep the current errors.New("webhook does not implement create") body and gate the engine reconciler's own apply path (it uses client.ForceOwnership with engine-controller as field owner, which bypasses admission for controller-initiated applies only if configured that way). Add a BATS test that rdd ctl apply -f of a hand-written Container with a bogus action annotation fails admission on create.

I2. patchContainerLastAction and removeActionAnnotation surface NotFound as an error Claude¶

pkg/controllers/app/engine/controllers/docker_watcher.go

			return fmt.Errorf("container %s is not running", id)
		}
		if !paused {
			return nil
		}
		log.Info("Unpausing container", "id", id)
		_, err = w.cli.ContainerUnpause(ctx, id, dockerclient.ContainerUnpauseOptions{})
		return err
	case containersv1alpha1.ContainerActionRestart:
		log.Info("Restarting container", "id", id)
		_, err := w.cli.ContainerRestart(ctx, id, dockerclient.ContainerRestartOptions{})
		return err
	}
	return fmt.Errorf("unknown container action %q", action)
}

// containerRunState reports whether Docker currently shows the container as
// running and as paused. Pause uses paused to skip a no-op dispatch; unpause
// also checks running so that unpause on a stopped container reports a
// failure instead of a silent success.
func (w *dockerWatcher) containerRunState(ctx context.Context, id string) (running, paused bool, err error) {
	inspect, err := w.cli.ContainerInspect(ctx, id, dockerclient.ContainerInspectOptions{})

If the Container mirror is deleted (by cleanupMirrorResources, a concurrent docker rm destroy event, or a user delete) between dispatchContainerAction and these helpers, the Get returns NotFound. retry.RetryOnConflict does not retry on NotFound, so the error propagates up through processContainerAction (line 371) into the joined error at reconcileContainerActions:374. The controller then requeues and logs "failed to patch lastAction: NotFound" for a container that no longer exists. The same log message fires for genuine conflict-exhaustion failures, so the noise misdirects a debugger.

pkg/controllers/app/engine/controllers/docker_watcher.go

	// The webhook rejects invalid action values, but one written while the
	// webhook is offline can still reach storage. Drop such values here;
	// otherwise the CRD enum rejects the status.lastAction write, the
	// annotation stays in place, and every reconcile retries forever.
	if !action.IsValid() {
		log.Info("Dropping invalid container action annotation", "id", c.Name, "action", raw)
		return w.removeActionAnnotation(ctx, c, raw)
	}

	dockerErr := w.dispatchContainerAction(ctx, log, c.Name, action)

Fix: treat NotFound as success on both the Get and the Patch, in both patchContainerLastAction and removeActionAnnotation.

 func (w *dockerWatcher) patchContainerLastAction(ctx context.Context, id string, lastAction *containersv1alpha1.ContainerLastAction) error {
     key := client.ObjectKey{Name: id, Namespace: w.apiNamespace}
     return retry.RetryOnConflict(retry.DefaultRetry, func() error {
         var latest containersv1alpha1.Container
         if err := w.k8s.Get(ctx, key, &latest); err != nil {
-            return err
+            return client.IgnoreNotFound(err)
         }
         patch := client.MergeFromWithOptions(latest.DeepCopy(), client.MergeFromWithOptimisticLock{})
         latest.Status.LastAction = lastAction
-        return w.k8s.Status().Patch(ctx, &latest, patch)
+        return client.IgnoreNotFound(w.k8s.Status().Patch(ctx, &latest, patch))
     })
 }

I3. Status-patch-then-annotation-patch races the informer cache Gemini¶

pkg/controllers/app/engine/controllers/docker_watcher.go


	// The webhook rejects invalid action values, but one written while the
	// webhook is offline can still reach storage. Drop such values here;
	// otherwise the CRD enum rejects the status.lastAction write, the
	// annotation stays in place, and every reconcile retries forever.
	if !action.IsValid() {
		log.Info("Dropping invalid container action annotation", "id", c.Name, "action", raw)
		return w.removeActionAnnotation(ctx, c, raw)
	}

	dockerErr := w.dispatchContainerAction(ctx, log, c.Name, action)

	lastAction := &containersv1alpha1.ContainerLastAction{
		Action:      action,
		ObservedAt:  observedAt,
		CompletedAt: metav1.Now(),

patchContainerLastAction writes the status subresource, which bumps the object's resourceVersion. removeActionAnnotation then re-Gets via the cached client and patches with client.MergeFromWithOptimisticLock{}, which attaches the Got resourceVersion to the payload. Between the two writes, the informer has not yet received the watch event for the status bump, so the Get returns the stale pre-status version; the subsequent patch then 409s. retry.RetryOnConflict re-runs the closure but re-reads the same cache, so successive attempts keep seeing stale data until the informer catches up. If the cache does not sync within retry.DefaultRetry's ~310ms budget, removeActionAnnotation returns the error, processContainerAction propagates it, and the next reconcile replays the whole Docker action because the annotation is still present. For start/stop/pause/unpause the replay is a no-op (Docker returns 304 or the pre-check skips); for restart the replay is a real double-restart.

Fix: thread the fresh Container object through from patchContainerLastAction into removeActionAnnotation. The status Patch updates the passed-in object in place with the new resourceVersion, so removeActionAnnotation can skip its initial Get on the happy path and patch against the known-fresh version.

-func (w *dockerWatcher) patchContainerLastAction(ctx context.Context, id string, lastAction *containersv1alpha1.ContainerLastAction) error {
+func (w *dockerWatcher) patchContainerLastAction(ctx context.Context, id string, lastAction *containersv1alpha1.ContainerLastAction) (*containersv1alpha1.Container, error) {
     key := client.ObjectKey{Name: id, Namespace: w.apiNamespace}
+    var latest containersv1alpha1.Container
-    return retry.RetryOnConflict(retry.DefaultRetry, func() error {
-        var latest containersv1alpha1.Container
+    err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
         if err := w.k8s.Get(ctx, key, &latest); err != nil {
             return err
         }
         patch := client.MergeFromWithOptions(latest.DeepCopy(), client.MergeFromWithOptimisticLock{})
         latest.Status.LastAction = lastAction
         return w.k8s.Status().Patch(ctx, &latest, patch)
     })
+    return &latest, err
 }

Note: preserving the current observed != current check in removeActionAnnotation is important for the concurrent-rewrite path (I1 design strength in Claude's and Gemini's observations). Threading the fresh object through removes the cache-race on the common path without losing that guard; if the threaded object shows the annotation still equal to observed, the Patch uses its resourceVersion.

Suggestions ¶

S1. Same-value action rewrites are indistinguishable from cleanup Codex Claude¶

pkg/controllers/app/engine/controllers/docker_watcher.go

	return fmt.Errorf("unknown container action %q", action)
}

// containerRunState reports whether Docker currently shows the container as
// running and as paused. Pause uses paused to skip a no-op dispatch; unpause
// also checks running so that unpause on a stopped container reports a
// failure instead of a silent success.
func (w *dockerWatcher) containerRunState(ctx context.Context, id string) (running, paused bool, err error) {
	inspect, err := w.cli.ContainerInspect(ctx, id, dockerclient.ContainerInspectOptions{})
	if err != nil {
		return false, false, err
	}
	return inspect.Container.State.Running, inspect.Container.State.Paused, nil
}

// patchContainerLastAction writes status.lastAction with retry-on-conflict
// and returns the updated Container. The main engine sync writes the
// status subresource on every reconcile, so this write races against it.
// If the mirror is deleted concurrently, any step may return NotFound;
// the caller treats a nil Container as "nothing left to do".
func (w *dockerWatcher) patchContainerLastAction(ctx context.Context, id string, lastAction *containersv1alpha1.ContainerLastAction) (*containersv1alpha1.Container, error) {
	key := client.ObjectKey{Name: id, Namespace: w.apiNamespace}
	var result *containersv1alpha1.Container
	err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
		var latest containersv1alpha1.Container
		if err := w.k8s.Get(ctx, key, &latest); err != nil {
			if apierrors.IsNotFound(err) {
				return nil
			}

removeActionAnnotation distinguishes concurrent writes by comparing the annotation value to the value it dispatched. A user who writes restart twice in quick succession (e.g., a deliberate re-trigger after the first completes) has the second write silently consumed because current == observed. For idempotent actions (start/stop/pause/unpause) the observable state is unchanged and this is fine. For restart, re-hitting the "Restart" button does not produce a second restart. The design is intentionally "latest wins, no queue" (api_containers.md:237-240), so this is a documentation opportunity rather than a code bug.

docs/design/api_containers.md

| --------- | ------------------- |
| `start`   | `ContainerStart`    |
| `stop`    | `ContainerStop`     |
| `pause`   | `ContainerPause`    |
| `unpause` | `ContainerUnpause`  |
| `restart` | `ContainerRestart`  |

A single annotation holds at most one pending action. Writing a new
value replaces the old, so a user who requests `pause` and then
`unpause` before the reconciler has run will see only the unpause. No
queue, no accumulating history.

The engine calls Docker before patching `status.lastAction` and
removing the annotation, so a crash mid-flight leaves the annotation

Fix: add a sentence to api_containers.md under "Change container state" noting that a repeated same-value action is collapsed — the GUI needs to rewrite to a different intermediate value (or let the first complete) to force a second dispatch. Alternatively, encode a request nonce in the annotation payload (e.g., restart/abc123) and compare by id instead of by action name.

S2. LastAction.RequestedAt name doesn't match what it records Claude¶

pkg/apis/containers/v1alpha1/container_types.go

	State ContainerActionState `json:"state"`
	// Error is the error message if the action failed.
	//
	// +optional
	Error string `json:"error,omitempty"`
	// ObservedAt is when the reconciler observed the action annotation.
	// Backlog may delay this relative to the user's write time.
	//
	// +optional
	ObservedAt metav1.Time `json:"observedAt,omitempty,omitzero"`
	// CompletedAt is when the action completed, regardless of outcome.
	//
	// +optional
	CompletedAt metav1.Time `json:"completedAt,omitempty,omitzero"`

requestedAt := metav1.Now()

The JSON key reads as "when the UI requested the action"; the value is actually "when the reconciler first observed the annotation". Under load the two can differ by seconds. Since the CRD has not shipped, renaming is cheap.

Fix: rename to observedAt (field, JSON tag, apply configuration, CRD enum, doc examples), or keep the name and sharpen the docstring: "the time the reconciler first observed the action annotation; backlog may delay it relative to the user's write time."

S3. The !action.IsValid() reconciler fallback has no test Claude¶

pkg/controllers/app/engine/controllers/docker_watcher.go

	}
	return client.IgnoreNotFound(w.k8s.Delete(ctx, latest))
}

// processContainerAction handles a container carrying the AnnotationAction
// annotation. It dispatches the Docker call, records the outcome in
// status.lastAction, and then removes the annotation.
//
// The Docker call runs before the status and metadata patches so that a
// crash mid-flight leaves the annotation in place and the next reconcile
// replays the action. Start, stop, pause, and unpause are idempotent
// against a container already in the target state, so replay is safe.
// Restart has no target state to match: a replay sends SIGTERM and waits
// the grace period a second time, which the controller cannot distinguish
// from a deliberate re-request.
func (w *dockerWatcher) processContainerAction(ctx context.Context, c *containersv1alpha1.Container) error {
	raw, ok := c.Annotations[containersv1alpha1.AnnotationAction]
	if !ok {

The comment describes a real failure mode that BATS cannot easily exercise (webhook offline). A unit test against a fake client that constructs a Container with annotations[AnnotationAction]="bogus", calls processContainerAction, and asserts the annotation was removed and lastAction was not written would close the gap without an integration tail. If I1 is applied, the exposure shrinks but doesn't disappear — the comment specifically targets the webhook-offline case.

S4. Restart replay after crash is more disruptive than "harmless" Claude¶

docs/design/api_containers.md

| `restart` | `ContainerRestart`  |

A single annotation holds at most one pending action. Writing a new
value replaces the old, so a user who requests `pause` and then
`unpause` before the reconciler has run will see only the unpause. No
queue, no accumulating history.

The engine calls Docker before patching `status.lastAction` and
removing the annotation, so a crash mid-flight leaves the annotation
in place and the next reconcile replays the action. Start, stop,
pause, and unpause are idempotent against a container already in the
target state, so replay is safe. Restart has no target state to match:
a replay sends SIGTERM and waits the grace period a second time, which
the controller cannot distinguish from a deliberate re-request.

If the Docker call fails (for example, `pause` on a container that is
not running), the reconciler still removes the annotation and records

# api_containers.md:246-248
# Restart has no target state to match,
# so a replay after a crash runs a second restart; the double restart
# is harmless.

"Harmless" understates it. ContainerRestart sends SIGTERM and waits the container's grace period; a double-restart during graceful shutdown can interrupt a long-running workload (database drain, mid-request HTTP handler) twice in quick succession.

Fix: reword as "a second restart interrupts whatever the container was doing; the controller cannot distinguish a crash-replay from a deliberate second request."

S5. AnnotationAction is a package-level var Claude¶

pkg/apis/containers/v1alpha1/container_types.go


// AnnotationAction requests a one-shot action on a container. The reconciler
// performs the action, records the outcome in status.lastAction, and removes
// the annotation. Setting a new value replaces any pending action — there is
// no queue.
const AnnotationAction = "containers.rancherdesktop.io/action"

// +kubebuilder:validation:Enum=start;stop;pause;unpause;restart

// ContainerAction is the name of an action that can be requested on a
// container via the AnnotationAction annotation.

SchemeGroupVersion.Group is a struct field, so this expression cannot be a Go const. Any package can reassign containersv1alpha1.AnnotationAction at runtime. Kubernetes uses const for similar annotation keys and spells the group out literally (kubectl.kubernetes.io/last-applied-configuration). Matching that idiom trades one line of duplicated group name for compile-time immutability.

Fix:

-var AnnotationAction = SchemeGroupVersion.Group + "/action"
+const AnnotationAction = "containers.rancherdesktop.io/action"

S6. Two rdd ctl get calls per assert_last_action Claude¶

bats/tests/32-app-controllers/engine-docker.bats

    rdd ctl annotate container "${cid}" --namespace="${RDD_NAMESPACE}" --overwrite \
        "containers.rancherdesktop.io/action=${action}"
    try --max 30 --delay 1 -- assert_action_consumed "${cid}"
}

assert_last_action() {
    local cid=$1 action=$2 state=$3
    run -0 rdd ctl get container "${cid}" --namespace="${RDD_NAMESPACE}" \
        -o jsonpath='{.status.lastAction.action}={.status.lastAction.state}'
    assert_output "${action}=${state}"
}

@test "stop action stops a running container" {
    run_e -0 docker run -d --name test-state busybox sleep inf
    cid=${output}

    rdd ctl wait --for=jsonpath='{.status.status}'=running \
        --namespace="${RDD_NAMESPACE}" container/"${cid}" --timeout=30s

Two roundtrips and two snapshots of the object. In practice the tests don't race their own assertions, but combining both reads into one jsonpath halves the API traffic and asserts both fields against the same object version:

 assert_last_action() {
     local cid=$1 action=$2 state=$3
     run -0 rdd ctl get container "${cid}" --namespace="${RDD_NAMESPACE}" \
-        -o jsonpath='{.status.lastAction.action}'
-    assert_output "${action}"
-    run -0 rdd ctl get container "${cid}" --namespace="${RDD_NAMESPACE}" \
-        -o jsonpath='{.status.lastAction.state}'
-    assert_output "${state}"
+        -o jsonpath='{.status.lastAction.action}={.status.lastAction.state}'
+    assert_output "${action}=${state}"
 }

Design Observations ¶

Strengths ¶

Implicit "in-progress" via annotation presence (in-scope) Gemini — Using the presence of the annotation as the implied "in-progress" indicator avoids a third, latency-inducing API write (e.g., state=InProgress) before dispatching the Docker call. Clean signal for the GUI.
Concurrent-rewrite safety via value-match (in-scope) Claude Gemini — removeActionAnnotation's current != observed check correctly preserves a concurrent rewrite to a different action (e.g., a user switching pause to stop mid-flight). This is a small amount of code doing a lot of work.
SSA simplification follows naturally (in-scope) Codex Claude — Collapsing the two SSA field managers (engine-controller and engine-controller-finalizer) into one is a direct consequence of spec becoming empty. The old dance existed only because engine-controller owned spec.state. Nothing to own, nothing to dance around.
Asymmetric pre-check is correct (in-scope) Claude Gemini — Pre-checking Docker state for pause/unpause but not start/stop is intentionally asymmetric. Docker returns 304 on redundant start/stop but silently succeeds on unpause against a non-paused container; the pre-check turns that silent success into a documented failure. The explanatory comment at docker_watcher.go:383-390 makes the asymmetry legible.
Status-only mirror is the right abstraction (in-scope) Codex — Abandoning the level-triggered spec.state in favor of a status-only mirror with edge-triggered intent is a better fit for a resource whose authoritative state lives outside K8s.

Concerns ¶

Reconciler full-lists the container list on every tick (future) Claude — reconcileContainerActions Lists all containers in the namespace on every reconcile, and every Container/Image/Volume watch event triggers a reconcile. With N containers and M events/second, cost is O(N·M). The comment at engine_reconciler.go:186-195 already captures this as the long-term plan (per-resource reconcilers with watch predicates); this PR's new flow doubles the per-reconcile list count (reconcileContainerActions + processFinalizers). Noted but explicitly deferred.
One-shot annotation drops same-value re-requests (future) Codex Claude — See S1 for the same observation framed as a documentation suggestion. For idempotent actions this is fine; for restart it means a deliberate re-trigger is silently collapsed. Design is intentional ("latest wins, no queue"); a doc note would close the loop.

Testing Assessment ¶

The five new action tests (engine-docker.bats:276-408) cover happy paths (start, stop, pause+unpause toggle, restart) plus two failure paths (pause-on-stopped, unpause-on-stopped) and webhook-reject for invalid action values. The pre-check logic on pause/unpause is exercised. Gaps, ranked by risk:

!action.IsValid() reconciler fallback (S3). No way to hit it via BATS unless the webhook is disabled. A unit test against a fake client would close this.
Webhook rejection on Create (conditional on I1 being applied).
Concurrent rewrite during dispatch. A BATS test that annotates action=stop, waits briefly, annotates action=start, and asserts the final state is running would exercise the observed != current skip path.
Action while mirror is mid-deletion. Exercises the DeletionTimestamp != nil skip in reconcileContainerActions:367 and the NotFound handling I2 targets.
Status-patch-then-annotation-patch race (I3). Hard to trigger deterministically in BATS; a unit test with a lagging fake client could force the stale-cache window.

Documentation Assessment ¶

docs/design/api_containers.md is rewritten cleanly around the action model with both Succeeded and Failed lastAction examples. Specific gaps:

The doc at api_containers.md:155-159 shows a Succeeded lastAction without an error field; the Failed example at 255-262 shows one. A one-liner — "error is set only when state: Failed" — would remove the ambiguity for skimmers.
api_containers.md:246-248 "harmless double-restart" wording is addressed by S4.
Missing note on lastAction persistence semantics: it survives observable state changes (docker restart of the underlying container, docker stop/start, etc.) because sync doesn't declare ownership of lastAction. Worth one sentence so GUI implementers don't assume lastAction is cleared on state transitions.

Commit Structure ¶

One commit, 6251b90 engine: replace Container.spec.state with action annotations. The message accurately describes the change, motivates it, and covers the edge cases (crash recovery, duplicate-dispatch pre-check, webhook-offline drop). Scope is a single conceptual unit — API type change, reconciler change, webhook change, CRD regeneration, design doc, tests. Stacked on #309 and a prerequisite for #315 per the PR description.

Acknowledged Limitations ¶

One-annotation, no queue — PR description and api_containers.md:237-240. A new write replaces the pending action; same-value re-requests collapse. This is intentional and is the reason S1 asks for a doc note rather than code changes.
Restart double-execution on crash — docker_watcher.go:333-335 and api_containers.md:246-248. Addressed by S4 (wording) rather than code (pre-check with StartedAt/CompletedAt window added complexity for a rare scenario).
Per-reconcile list cost — engine_reconciler.go:186-195. Long-term plan is per-resource reconcilers with watch predicates; this PR keeps the same tradeoff.
No CLI support for actions — author comment on the PR. The action API is GUI-facing by design; CLI users are directed to docker/nerdctl directly. Documented in api_containers.md:264-266.

Agent Performance Retro ¶

Claude ¶

Caught both I1 (webhook on Create) and I2 (NotFound propagation) plus a dense suggestion list focused on polish (naming, const vs var, test-helper consolidation, doc wording). The I2 analysis is Claude-only: the other two agents looked at the same helpers but did not notice the NotFound path. S4's "restart double-execution is more disruptive than harmless" is also uniquely Claude's — it pushed back on the author's own phrasing. Balanced output: nothing felt inflated, and the suggestions are all actionable.

Codex ¶

Converged with Claude on I1 — same root cause (validator registered for Update only), different framing. Codex additionally raised S1 (same-value rewrites collapse), which Claude surfaced as a design concern rather than a numbered finding; framing as a numbered suggestion makes it easier to act on. Coverage was clean, and the pass had no false positives. Shorter output than Claude, but what's there is correct.

Gemini ¶

Initial launch hit a 429 MODEL_CAPACITY_EXHAUSTED retry loop for ~10 minutes on the preview model; killed and relaunched after the user updated gemini-cli. The second run produced a single, sharp finding: I3 (status-patch-then-annotation-patch cache race). This was missed by both Claude and Codex despite them both reading the same helpers. The fix Gemini proposed — thread the fresh object through — is the right shape, though it needs a small refinement to preserve the concurrent-rewrite guard on the current != observed path. Gemini marked container_webhooks.go as "Reviewed, no issues", missing I1; a coverage gap.

Summary ¶

	Claude Opus 4.6	Codex GPT 5.4	Gemini 3.1 Pro
Duration	13m 58s	5m 06s	6m 52s
Findings	2I 5S	1I 1S	1I
Tool calls	49 (Grep 21, Read 14, Bash 9)	43 (shell 42, stdin 1)	28 (runshellcommand 16, grepsearch 6, writefile 4)
Design observations	6	2	3
False positives	0	0	0
Unique insights	3	1	1
Files reviewed	15	15	15
Coverage misses	0	0	1
Totals	2I 5S	1I 1S	1I
Downgraded	0	0	0
Dropped	0	0	0

Overall, each agent contributed exactly one independent finding that the others missed (Claude: I2; Codex: S1; Gemini: I3). Claude was the highest-volume and most polished; Codex was the most concise with the cleanest coverage; Gemini's single catch (I3) was the subtlest of the three and would likely have been missed entirely without it.

Review Process Notes ¶

None for this review. The existing prompts did their job; the findings that emerged were all naturally derived from the stated dimensions (admission symmetry, subresource-patch races, error propagation, naming). No skill-level generalisation suggests itself without stretching into specifics of this PR, and deep-review-context.md was not contradicted by anything under review.

Resolution ¶


Addressed	2026-04-16
Commit	`397d40a engine: replace Container.spec.state with action annotations` (fixup-squashed into the original)

#	Finding	Action
1	Important #1: Action-annotation validation absent on Create	Fixed
2	Important #2: patchContainerLastAction/removeActionAnnotation surface NotFound as error	Fixed
3	Important #3: Status-patch-then-annotation-patch races the informer cache	Fixed
4	Suggestion #1: Same-value action rewrites indistinguishable from cleanup	Skipped
5	Suggestion #2: LastAction.RequestedAt name doesn't match what it records	Fixed (renamed to observedAt)
6	Suggestion #3: !action.IsValid() reconciler fallback has no test	Skipped
7	Suggestion #4: Restart replay after crash more disruptive than "harmless"	Fixed (reworded doc and comment)
8	Suggestion #5: AnnotationAction is a package-level var	Fixed (const with literal group)
9	Suggestion #6: Two rdd ctl get calls per assertlastaction	Fixed (single jsonpath)
10	Testing Gap #1: No test for !action.IsValid() fallback	Skipped
11	Testing Gap #2: Webhook rejection on Create untested	Test written
12	Testing Gap #3: Concurrent rewrite during dispatch untested	Skipped
13	Testing Gap #4: Action while mirror is mid-deletion untested	Skipped
14	Testing Gap #5: Status-patch-then-annotation-patch race untested	Skipped
15	Documentation Gap #1: error field set only when state: Failed	Documentation updated
16	Documentation Gap #2: lastAction persistence semantics undocumented	Documentation updated