Deep Review: 20260415-184215-pr-316

Date2026-04-15 18:42
Reporancher-sandbox/rancher-desktop-daemon
Round1
Author@jandubois
PR#316 — engine: replace Container.spec.state with action annotations
Branchcontainer-action-annotations
Commitsd4dd09d engine: replace Container.spec.state with action annotations
ReviewersClaude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro
VerdictMerge with fixes — invalid-annotation wedge (I2) has a clear reproduction path and needs a guard
Wall-clock time1 h 20 min 55 s


Executive Summary

This PR replaces the level-triggered Container.spec.state field with a one-shot containers.rancherdesktop.io/action annotation, eliminating reconciler thrashing against Docker's restart policy and out-of-band docker start/stop. The annotation model is well-designed: dispatch-before-patch gives crash recovery, the compare-and-swap annotation removal preserves concurrent writes, and collapsing the two-field-owner SSA dance after emptying ContainerSpec materially simplifies sync_containers.go.

Two important issues remain. First, the annotation cleanup patch lacks optimistic concurrency, creating a narrow TOCTOU window where a concurrent action can be silently dropped (I1). Second, the webhook does not validate annotation values, so an invalid action string wedges the reconciler in an infinite retry loop because the CRD enum rejects the status write while the annotation stays in place (I2).

Structure: 15 changed files, +543 −288 lines. API types, generated apply configurations, CRD, webhook, engine controller, mock controller, BATS integration tests, design documentation.


Critical Issues

None.


Important Issues

I1. Annotation removal patch lacks optimistic concurrency Gemini
// pass, so the lastAction write races against it.
func (w *dockerWatcher) patchContainerLastAction(ctx context.Context, id string, lastAction *containersv1alpha1.ContainerLastAction) error {
	key := client.ObjectKey{Name: id, Namespace: w.apiNamespace}
	return retry.RetryOnConflict(retry.DefaultRetry, func() error {
		var latest containersv1alpha1.Container
		if err := w.k8s.Get(ctx, key, &latest); err != nil {
			return err
		}
		patch := client.MergeFromWithOptions(latest.DeepCopy(), client.MergeFromWithOptimisticLock{})
		latest.Status.LastAction = lastAction
		return w.k8s.Status().Patch(ctx, &latest, patch)
	})
}

// removeActionAnnotation clears the AnnotationAction annotation only if its
// value still matches the action we just processed. A concurrent writer may
// have replaced the annotation with a fresh action between dispatch and
// cleanup; in that case the new value must survive so the next reconcile
// picks it up.
func (w *dockerWatcher) removeActionAnnotation(ctx context.Context, id, observed string) error {
	key := client.ObjectKey{Name: id, Namespace: w.apiNamespace}
	return retry.RetryOnConflict(retry.DefaultRetry, func() error {
		var latest containersv1alpha1.Container
		if err := w.k8s.Get(ctx, key, &latest); err != nil {
			return err
		}

removeActionAnnotation reads the latest object, checks that the annotation still matches the processed value (lines 460–462), then sends a merge patch to remove it. client.MergeFrom does not include resourceVersion in the patch, so the API server applies it unconditionally. If a concurrent writer replaces the annotation with a new action between the Get and the Patch, the merge patch removes the new value and the action is silently lost.

The compare-and-swap check on lines 460–462 is correct in intent but does not close the TOCTOU window — it guards against changes visible in the cache at Get time, not changes committed to etcd after the Get.

The same pattern appears in patchContainerLastAction (line 442), though the consequence there is a lost status update rather than a lost action.

// failure instead of a silent success.
func (w *dockerWatcher) containerRunState(ctx context.Context, id string) (running, paused bool, err error) {
	inspect, err := w.cli.ContainerInspect(ctx, id, dockerclient.ContainerInspectOptions{})
	if err != nil {
		return false, false, err
	}
	return inspect.Container.State.Running, inspect.Container.State.Paused, nil
}

// patchContainerLastAction writes status.lastAction with retry-on-conflict.
// The main engine sync writes to the status subresource on every reconcile

Fix: use client.MergeFromWithOptions(latest.DeepCopy(), client.MergeFromWithOptimisticLock{}) to include resourceVersion in the patch. The existing RetryOnConflict loop already handles the resulting 409 correctly.

I2. Invalid action annotation wedges reconciliation Codex
	if !ok {
		return nil
	}

	log := logf.FromContext(ctx).WithName("docker-watcher")
	action := containersv1alpha1.ContainerAction(raw)
	requestedAt := metav1.Now()

	// The webhook rejects invalid action values, but one written while the
	// webhook is offline can still reach storage. Drop such values here;
	// otherwise the CRD enum rejects the status.lastAction write, the
	// annotation stays in place, and every reconcile retries forever.
	if !action.IsValid() {
		log.Info("Dropping invalid container action annotation", "id", c.Name, "action", raw)
		return w.removeActionAnnotation(ctx, c.Name, raw)
	}

	dockerErr := w.dispatchContainerAction(ctx, log, c.Name, action)

	lastAction := &containersv1alpha1.ContainerLastAction{
		Action:      action,
		RequestedAt: requestedAt,
		CompletedAt: metav1.Now(),
	}
	if dockerErr == nil {
		lastAction.State = containersv1alpha1.ContainerActionSucceeded
	} else {
		lastAction.State = containersv1alpha1.ContainerActionFailed
		lastAction.Error = dockerErr.Error()
		log.Info("Container action failed", "id", c.Name, "action", action, "error", dockerErr)
	}

	if err := w.patchContainerLastAction(ctx, c.Name, lastAction); err != nil {
		return fmt.Errorf("failed to patch lastAction for %s: %w", c.Name, err)
	}

The update webhook validates only spec immutability; annotations are free-form. A misspelled annotation value (e.g. action=staart) passes the webhook, reaches dispatchContainerAction (which returns "unknown container action"), and then patchContainerLastAction tries to write the invalid value into status.lastAction.action. The CRD enum (start|stop|pause|unpause|restart) rejects the status write, so patchContainerLastAction fails and removeActionAnnotation never runs. The bad annotation persists and every subsequent reconcile retries the same impossible action.

Fix: validate the annotation value in the update webhook, and add a defensive guard in processContainerAction that removes the annotation and logs a warning for unrecognized values instead of attempting a status write that cannot succeed.


Suggestions

S1. InProgress action state is defined but never written Claude Codex Gemini
	switch a {
	case ContainerActionStart, ContainerActionStop,
		ContainerActionPause, ContainerActionUnpause,
		ContainerActionRestart:
		return true
	}
	return false
}

// +kubebuilder:validation:Enum=Succeeded;Failed

// ContainerActionState describes the outcome of an action that was requested

The reconciler blocks on the Docker call and writes only Succeeded or Failed. No code path ever writes InProgress. The enum value and its CRD entry are dead code.

Fix: remove ContainerActionInProgress from the type, the CRD, and the generated deep-copy until the controller needs it. Writing InProgress before dispatch would require an extra API call per action and break the crash-recovery property (annotation-present = in-progress).

S2. unpause on a stopped container silently records Succeeded Gemini
	case containersv1alpha1.ContainerActionStop:
		log.Info("Stopping container", "id", id)
		_, err := w.cli.ContainerStop(ctx, id, dockerclient.ContainerStopOptions{})
		return err
	case containersv1alpha1.ContainerActionPause:
		_, paused, err := w.containerRunState(ctx, id)
		if err != nil {
			return err
		}
		if paused {
			return nil
		}
		log.Info("Pausing container", "id", id)
		_, err = w.cli.ContainerPause(ctx, id, dockerclient.ContainerPauseOptions{})
		return err
	case containersv1alpha1.ContainerActionUnpause:
		running, paused, err := w.containerRunState(ctx, id)
		if err != nil {

When a user requests unpause on a stopped container, isContainerPaused returns false, dispatchContainerAction returns nil, and processContainerAction records Succeeded. By contrast, pause on a stopped container reaches Docker and correctly records Failed. The asymmetry could confuse GUI developers.

Fix: check inspect.Container.State.Running before the !paused short-circuit and return an error when the container is not running.

S3. Restart-idempotency sentence contradicts itself Claude
| `restart` | `ContainerRestart`  |

A single annotation holds at most one pending action. Writing a new
value replaces the old, so a user who requests `pause` and then
`unpause` before the reconciler has run will see only the unpause. No
queue, no accumulating history.

The engine calls Docker before patching `status.lastAction` and
removing the annotation, so a crash mid-flight leaves the annotation
in place and the next reconcile replays the action. Start, stop,
pause, and unpause are idempotent against a container already in the
target state, so replay is safe. Restart has no target state to match,
so a replay after a crash runs a second restart; the double restart
is harmless.

The clause lists restart as idempotent, then the next clause names restart as the exception. The code comment at docker_watcher.go:374-381 is precise: start/stop are idempotent via Docker's 304, pause/unpause via the reconciler pre-check, and restart may double-restart (harmlessly). The user-facing doc should mirror that.


	if err := w.patchContainerLastAction(ctx, c.Name, lastAction); err != nil {
		return fmt.Errorf("failed to patch lastAction for %s: %w", c.Name, err)
	}
	if err := w.removeActionAnnotation(ctx, c.Name, raw); err != nil {
		return fmt.Errorf("failed to remove action annotation for %s: %w", c.Name, err)
	}
	return nil
}

// dispatchContainerAction executes the Docker call for a single action. The
// caller pre-validates the action name; the default branch triggers only
// when a new ContainerAction value is added to the type but not to the switch.
//
// Pause and unpause pre-check the current Docker state and return nil when
// the container is already in the target state. Two reconcile ticks can read
// the same action annotation through the informer cache before the cache
// sees the removal that follows the first dispatch — the pre-check keeps

Fix: drop restart from the idempotent list and state that a double restart is harmless.


Design Observations

Concerns

Strengths


Testing Assessment

The BATS tests cover the five actions (start, stop, pause, unpause, restart), the failure path (pause on a stopped container), and verify lastAction outcomes. Coverage gaps:

  1. No test covers an invalid annotation value (I2). An invalid-action test would confirm the webhook rejects it and the reconciler does not wedge.
  2. No test covers unpause on a stopped container (S2), which would expose the false-success asymmetry.
  3. The duplicate-informer-delivery scenario described in docker_watcher.go:374-381 is comment-verified only; deterministic BATS coverage would require timing control that is impractical.

Documentation Assessment

docs/design/api_containers.md clearly explains the rationale for the annotation model, the crash-recovery trade-off for restart, and the lastAction contract. The documentation mentions only Succeeded and Failed as terminal states, consistent with the implementation — but the CRD schema still exposes InProgress (S1).


Acknowledged Limitations


Agent Performance Retro

Claude

Claude's initial run produced no output (cause unknown); a rerun succeeded. It missed both important issues (I1, I2) but contributed the review's most architecturally grounded observation: the non-SSA merge-patch shape of patchContainerLastAction is what keeps applyContainer's status apply from pruning lastAction. Codex and Gemini both noted the simplification in sync_containers.go but neither traced why the new write had to be a merge patch rather than a status apply. Claude also raised a precise documentation nit (S3, the self-contradicting restart sentence) that the other two missed.

Codex

Codex found the most impactful issue in the review: the invalid-annotation wedge (I2), which no other agent raised. Its analysis traced the full path from webhook gap through CRD enum rejection to infinite retry, with a concrete fix. It correctly identified InProgress as dead code (S1). It did not flag the optimistic-concurrency gap in removeActionAnnotation (I1).

Gemini

Gemini identified the annotation-removal race condition (I1, filed as critical) and the unpause-on-stopped asymmetry (S2, filed as important). Both are real issues. Gemini's C1 severity was too high — the TOCTOU window is narrow and the intended caller (GUI) serializes actions naturally — so it was downgraded to important. Gemini's I1 was downgraded to suggestion for the same practical-impact reason. Gemini also caught S1 (InProgress dead code), matching Codex.

Summary

Claude Opus 4.6Codex GPT 5.4Gemini 3.1 Pro
Duration14m 04s5m 51s3m 24s
Findings2S1I 1S1I 1S
Tool calls36 (Read 17, Grep 14, Bash 4)41 (shell 40, stdin 1)3 (readfile 2, runshell_command 1)
Design observations323
False positives000
Unique insights211
Files reviewed151515
Coverage misses000
Totals2S1I 1S1I 1S
Downgraded002 (I→S)
Dropped000

Reconciliation: Gemini C1 (annotation removal race): critical → important I1, because the TOCTOU window is narrow and the GUI serializes actions naturally. Gemini I1 (unpause on stopped): important → suggestion S2, because the scenario is unusual and does not cause data loss or crashes.


Review Process Notes

No suggestions.


Resolution

Addressed2026-04-16
Commit6251b90 engine: replace Container.spec.state with action annotations (fixup-squashed into the original)
#FindingAction
1Important #1: Annotation removal patch lacks optimistic concurrencyFixed
2Important #2: Invalid action annotation wedges reconciliationFixed
3Suggestion #1: InProgress action state is defined but never writtenFixed
4Suggestion #2: unpause on a stopped container silently records SucceededFixed
5Suggestion #3: Restart-idempotency sentence contradicts itselfFixed
6Testing Gap #1: No test for invalid annotation valueTest written
7Testing Gap #2: No test for unpause on a stopped containerTest written
8Testing Gap #3: Duplicate-informer-delivery scenario is comment-verified onlySkipped