Deep Review: 20260415-184215-pr-316
| Date | 2026-04-15 18:42 |
| Repo | rancher-sandbox/rancher-desktop-daemon |
| Round | 1 |
| Author | @jandubois |
| PR | #316 — engine: replace Container.spec.state with action annotations |
| Branch | container-action-annotations |
| Commits | d4dd09d engine: replace Container.spec.state with action annotations |
| Reviewers | Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro |
| Verdict | Merge with fixes — invalid-annotation wedge (I2) has a clear reproduction path and needs a guard |
| Wall-clock time | 1 h 20 min 55 s |
Executive Summary ¶
This PR replaces the level-triggered Container.spec.state field with a one-shot containers.rancherdesktop.io/action annotation, eliminating reconciler thrashing against Docker's restart policy and out-of-band docker start/stop. The annotation model is well-designed: dispatch-before-patch gives crash recovery, the compare-and-swap annotation removal preserves concurrent writes, and collapsing the two-field-owner SSA dance after emptying ContainerSpec materially simplifies sync_containers.go.
Two important issues remain. First, the annotation cleanup patch lacks optimistic concurrency, creating a narrow TOCTOU window where a concurrent action can be silently dropped (I1). Second, the webhook does not validate annotation values, so an invalid action string wedges the reconciler in an infinite retry loop because the CRD enum rejects the status write while the annotation stays in place (I2).
Structure: 15 changed files, +543 −288 lines. API types, generated apply configurations, CRD, webhook, engine controller, mock controller, BATS integration tests, design documentation.
Critical Issues ¶
None.
Important Issues ¶
// pass, so the lastAction write races against it.
func (w *dockerWatcher) patchContainerLastAction(ctx context.Context, id string, lastAction *containersv1alpha1.ContainerLastAction) error {
key := client.ObjectKey{Name: id, Namespace: w.apiNamespace}
return retry.RetryOnConflict(retry.DefaultRetry, func() error {
var latest containersv1alpha1.Container
if err := w.k8s.Get(ctx, key, &latest); err != nil {
return err
}
patch := client.MergeFromWithOptions(latest.DeepCopy(), client.MergeFromWithOptimisticLock{})
latest.Status.LastAction = lastAction
return w.k8s.Status().Patch(ctx, &latest, patch)
})
}
// removeActionAnnotation clears the AnnotationAction annotation only if its
// value still matches the action we just processed. A concurrent writer may
// have replaced the annotation with a fresh action between dispatch and
// cleanup; in that case the new value must survive so the next reconcile
// picks it up.
func (w *dockerWatcher) removeActionAnnotation(ctx context.Context, id, observed string) error {
key := client.ObjectKey{Name: id, Namespace: w.apiNamespace}
return retry.RetryOnConflict(retry.DefaultRetry, func() error {
var latest containersv1alpha1.Container
if err := w.k8s.Get(ctx, key, &latest); err != nil {
return err
}
removeActionAnnotation reads the latest object, checks that the annotation still matches the processed value (lines 460–462), then sends a merge patch to remove it. client.MergeFrom does not include resourceVersion in the patch, so the API server applies it unconditionally. If a concurrent writer replaces the annotation with a new action between the Get and the Patch, the merge patch removes the new value and the action is silently lost.
The compare-and-swap check on lines 460–462 is correct in intent but does not close the TOCTOU window — it guards against changes visible in the cache at Get time, not changes committed to etcd after the Get.
The same pattern appears in patchContainerLastAction (line 442), though the consequence there is a lost status update rather than a lost action.
// failure instead of a silent success.
func (w *dockerWatcher) containerRunState(ctx context.Context, id string) (running, paused bool, err error) {
inspect, err := w.cli.ContainerInspect(ctx, id, dockerclient.ContainerInspectOptions{})
if err != nil {
return false, false, err
}
return inspect.Container.State.Running, inspect.Container.State.Paused, nil
}
// patchContainerLastAction writes status.lastAction with retry-on-conflict.
// The main engine sync writes to the status subresource on every reconcile
Fix: use client.MergeFromWithOptions(latest.DeepCopy(), client.MergeFromWithOptimisticLock{}) to include resourceVersion in the patch. The existing RetryOnConflict loop already handles the resulting 409 correctly.
if !ok {
return nil
}
log := logf.FromContext(ctx).WithName("docker-watcher")
action := containersv1alpha1.ContainerAction(raw)
requestedAt := metav1.Now()
// The webhook rejects invalid action values, but one written while the
// webhook is offline can still reach storage. Drop such values here;
// otherwise the CRD enum rejects the status.lastAction write, the
// annotation stays in place, and every reconcile retries forever.
if !action.IsValid() {
log.Info("Dropping invalid container action annotation", "id", c.Name, "action", raw)
return w.removeActionAnnotation(ctx, c.Name, raw)
}
dockerErr := w.dispatchContainerAction(ctx, log, c.Name, action)
lastAction := &containersv1alpha1.ContainerLastAction{
Action: action,
RequestedAt: requestedAt,
CompletedAt: metav1.Now(),
}
if dockerErr == nil {
lastAction.State = containersv1alpha1.ContainerActionSucceeded
} else {
lastAction.State = containersv1alpha1.ContainerActionFailed
lastAction.Error = dockerErr.Error()
log.Info("Container action failed", "id", c.Name, "action", action, "error", dockerErr)
}
if err := w.patchContainerLastAction(ctx, c.Name, lastAction); err != nil {
return fmt.Errorf("failed to patch lastAction for %s: %w", c.Name, err)
}
The update webhook validates only spec immutability; annotations are free-form. A misspelled annotation value (e.g. action=staart) passes the webhook, reaches dispatchContainerAction (which returns "unknown container action"), and then patchContainerLastAction tries to write the invalid value into status.lastAction.action. The CRD enum (start|stop|pause|unpause|restart) rejects the status write, so patchContainerLastAction fails and removeActionAnnotation never runs. The bad annotation persists and every subsequent reconcile retries the same impossible action.
Fix: validate the annotation value in the update webhook, and add a defensive guard in processContainerAction that removes the annotation and logs a warning for unrecognized values instead of attempting a status write that cannot succeed.
Suggestions ¶
switch a {
case ContainerActionStart, ContainerActionStop,
ContainerActionPause, ContainerActionUnpause,
ContainerActionRestart:
return true
}
return false
}
// +kubebuilder:validation:Enum=Succeeded;Failed
// ContainerActionState describes the outcome of an action that was requested
The reconciler blocks on the Docker call and writes only Succeeded or Failed. No code path ever writes InProgress. The enum value and its CRD entry are dead code.
Fix: remove ContainerActionInProgress from the type, the CRD, and the generated deep-copy until the controller needs it. Writing InProgress before dispatch would require an extra API call per action and break the crash-recovery property (annotation-present = in-progress).
case containersv1alpha1.ContainerActionStop:
log.Info("Stopping container", "id", id)
_, err := w.cli.ContainerStop(ctx, id, dockerclient.ContainerStopOptions{})
return err
case containersv1alpha1.ContainerActionPause:
_, paused, err := w.containerRunState(ctx, id)
if err != nil {
return err
}
if paused {
return nil
}
log.Info("Pausing container", "id", id)
_, err = w.cli.ContainerPause(ctx, id, dockerclient.ContainerPauseOptions{})
return err
case containersv1alpha1.ContainerActionUnpause:
running, paused, err := w.containerRunState(ctx, id)
if err != nil {
When a user requests unpause on a stopped container, isContainerPaused returns false, dispatchContainerAction returns nil, and processContainerAction records Succeeded. By contrast, pause on a stopped container reaches Docker and correctly records Failed. The asymmetry could confuse GUI developers.
Fix: check inspect.Container.State.Running before the !paused short-circuit and return an error when the container is not running.
| `restart` | `ContainerRestart` |
A single annotation holds at most one pending action. Writing a new
value replaces the old, so a user who requests `pause` and then
`unpause` before the reconciler has run will see only the unpause. No
queue, no accumulating history.
The engine calls Docker before patching `status.lastAction` and
removing the annotation, so a crash mid-flight leaves the annotation
in place and the next reconcile replays the action. Start, stop,
pause, and unpause are idempotent against a container already in the
target state, so replay is safe. Restart has no target state to match,
so a replay after a crash runs a second restart; the double restart
is harmless.
The clause lists restart as idempotent, then the next clause names restart as the exception. The code comment at docker_watcher.go:374-381 is precise: start/stop are idempotent via Docker's 304, pause/unpause via the reconciler pre-check, and restart may double-restart (harmlessly). The user-facing doc should mirror that.
if err := w.patchContainerLastAction(ctx, c.Name, lastAction); err != nil {
return fmt.Errorf("failed to patch lastAction for %s: %w", c.Name, err)
}
if err := w.removeActionAnnotation(ctx, c.Name, raw); err != nil {
return fmt.Errorf("failed to remove action annotation for %s: %w", c.Name, err)
}
return nil
}
// dispatchContainerAction executes the Docker call for a single action. The
// caller pre-validates the action name; the default branch triggers only
// when a new ContainerAction value is added to the type but not to the switch.
//
// Pause and unpause pre-check the current Docker state and return nil when
// the container is already in the target state. Two reconcile ticks can read
// the same action annotation through the informer cache before the cache
// sees the removal that follows the first dispatch — the pre-check keeps
Fix: drop restart from the idempotent list and state that a double restart is harmless.
Design Observations ¶
Concerns ¶
- O(N) reconcile cost per child event (in-scope):
reconcileContainerActionsandprocessFinalizersboth list all resources on every reconcile. The code comment atengine_reconciler.go:186-190acknowledges this and identifies per-resource reconcilers with watch predicates as the long-term fix. This PR does not worsen the cost, but the action-annotation watch predicate (only reconcile containers with the annotation set) is a natural next step. Codex Gemini
Strengths ¶
- Crash-recovery ordering (in-scope): dispatching the Docker call before patching status or removing the annotation ensures that a crash mid-flight leaves the annotation in place for replay. Start, stop, pause, and unpause are idempotent against a container already in the target state, so replay is safe. Gemini
- SSA simplification (in-scope): collapsing the two-field-manager apply in
sync_containers.goafter emptyingContainerSpecremoves a subtle source of pruning bugs and cuts 30 lines of code. Codex - Atomic annotation replacement (in-scope): the single-annotation model means a new value atomically replaces any pending action — no queue, no accumulating history, and a GUI user who changes their mind before the reconciler runs sees only the final action. Codex
lastActionis deliberately written via non-SSA merge patch (in-scope):patchContainerLastActionusesclient.MergeFromrather than an SSA apply-config, soengine-controllernever claimsstatus.lastActionas an SSA-owned field. This matters becauseapplyContainerinsync_containers.goruns on every Docker event and applies a status config that omitslastAction; ifengine-controllerhad claimed the field under SSA, the subsequent apply would release-and-prune it. The asymmetric write pattern is load-bearing, not incidental. Claude
Testing Assessment ¶
The BATS tests cover the five actions (start, stop, pause, unpause, restart), the failure path (pause on a stopped container), and verify lastAction outcomes. Coverage gaps:
- No test covers an invalid annotation value (I2). An invalid-action test would confirm the webhook rejects it and the reconciler does not wedge.
- No test covers
unpauseon a stopped container (S2), which would expose the false-success asymmetry. - The duplicate-informer-delivery scenario described in
docker_watcher.go:374-381is comment-verified only; deterministic BATS coverage would require timing control that is impractical.
Documentation Assessment ¶
docs/design/api_containers.md clearly explains the rationale for the annotation model, the crash-recovery trade-off for restart, and the lastAction contract. The documentation mentions only Succeeded and Failed as terminal states, consistent with the implementation — but the CRD schema still exposes InProgress (S1).
Acknowledged Limitations ¶
- Double-restart on crash recovery: the PR description documents that
restartmay run twice on crash recovery. This is an accepted idempotency gap. - No CLI support for actions: the PR comment confirms actions exist solely for the GUI; CLI users should use
docker start/stop/etc.directly.
Agent Performance Retro ¶
Claude ¶
Claude's initial run produced no output (cause unknown); a rerun succeeded. It missed both important issues (I1, I2) but contributed the review's most architecturally grounded observation: the non-SSA merge-patch shape of patchContainerLastAction is what keeps applyContainer's status apply from pruning lastAction. Codex and Gemini both noted the simplification in sync_containers.go but neither traced why the new write had to be a merge patch rather than a status apply. Claude also raised a precise documentation nit (S3, the self-contradicting restart sentence) that the other two missed.
Codex ¶
Codex found the most impactful issue in the review: the invalid-annotation wedge (I2), which no other agent raised. Its analysis traced the full path from webhook gap through CRD enum rejection to infinite retry, with a concrete fix. It correctly identified InProgress as dead code (S1). It did not flag the optimistic-concurrency gap in removeActionAnnotation (I1).
Gemini ¶
Gemini identified the annotation-removal race condition (I1, filed as critical) and the unpause-on-stopped asymmetry (S2, filed as important). Both are real issues. Gemini's C1 severity was too high — the TOCTOU window is narrow and the intended caller (GUI) serializes actions naturally — so it was downgraded to important. Gemini's I1 was downgraded to suggestion for the same practical-impact reason. Gemini also caught S1 (InProgress dead code), matching Codex.
Summary ¶
| Claude Opus 4.6 | Codex GPT 5.4 | Gemini 3.1 Pro | |
|---|---|---|---|
| Duration | 14m 04s | 5m 51s | 3m 24s |
| Findings | 2S | 1I 1S | 1I 1S |
| Tool calls | 36 (Read 17, Grep 14, Bash 4) | 41 (shell 40, stdin 1) | 3 (readfile 2, runshell_command 1) |
| Design observations | 3 | 2 | 3 |
| False positives | 0 | 0 | 0 |
| Unique insights | 2 | 1 | 1 |
| Files reviewed | 15 | 15 | 15 |
| Coverage misses | 0 | 0 | 0 |
| Totals | 2S | 1I 1S | 1I 1S |
| Downgraded | 0 | 0 | 2 (I→S) |
| Dropped | 0 | 0 | 0 |
Reconciliation: Gemini C1 (annotation removal race): critical → important I1, because the TOCTOU window is narrow and the GUI serializes actions naturally. Gemini I1 (unpause on stopped): important → suggestion S2, because the scenario is unusual and does not cause data loss or crashes.
Review Process Notes ¶
No suggestions.
Resolution ¶
| Addressed | 2026-04-16 |
| Commit | 6251b90 engine: replace Container.spec.state with action annotations (fixup-squashed into the original) |
| # | Finding | Action |
|---|---|---|
| 1 | Important #1: Annotation removal patch lacks optimistic concurrency | Fixed |
| 2 | Important #2: Invalid action annotation wedges reconciliation | Fixed |
| 3 | Suggestion #1: InProgress action state is defined but never written | Fixed |
| 4 | Suggestion #2: unpause on a stopped container silently records Succeeded | Fixed |
| 5 | Suggestion #3: Restart-idempotency sentence contradicts itself | Fixed |
| 6 | Testing Gap #1: No test for invalid annotation value | Test written |
| 7 | Testing Gap #2: No test for unpause on a stopped container | Test written |
| 8 | Testing Gap #3: Duplicate-informer-delivery scenario is comment-verified only | Skipped |