# Deep Review: 20260416-120302-pr-316

| | |
|---|---|
| **Date** | 2026-04-16 12:44 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 2 |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#316](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/316) — engine: replace Container.spec.state with action annotations |
| **Commits** | `6251b90` engine: replace Container.spec.state with action annotations |
| **Reviewers** | Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro |
| **Verdict** | **Merge with fixes** — design is sound; I1 and I2 are genuine gaps worth addressing before merge. |
| **Wall-clock time** | `44 min 54 s` |

## Executive Summary

The PR replaces `Container.spec.state` (a level-triggered desired-state field)
with a one-shot `containers.rancherdesktop.io/action` annotation and a new
`status.lastAction` outcome record. The motivation — Docker's restart policy
and direct `docker start/stop` are concurrent writers that a level-trigger
mirror cannot sanely own — is correct, and the resulting shape (empty spec,
edge-triggered intent, persistent outcome) is a clean fit. The SSA
simplification (dropping the `engine-controller-finalizer` field owner) falls
out naturally once the spec is empty, and the pause/unpause pre-checks address
the documented informer-cache replay hazard.

Three important gaps remain on the edges: the admission webhook validates the
action annotation only on `Update`, leaving `Create` unguarded;
`patchContainerLastAction` and `removeActionAnnotation` both surface `NotFound`
as an error on self-healing races; and the Get-Patch sequence between
`patchContainerLastAction` and `removeActionAnnotation` can race the informer
cache, producing 409 loops that replay the Docker action. None is critical,
but all three are worth fixing before merge.

---

## Critical Issues

None.

## Important Issues

I1. **Action-annotation validation absent on `Create`** — `pkg/controllers/containers/container/container_controller.go:82-92`, `pkg/controllers/containers/container/container_webhooks.go:24-26,41-45` [Claude Opus 4.6, Codex GPT 5.4]

```go
// container_controller.go
Operations: []admissionregistrationv1.OperationType{
    admissionregistrationv1.Update,
},
Validator: &immutableValidator{},
```

```go
// container_webhooks.go
func (c *immutableValidator) ValidateCreate(context.Context, *v1alpha1.Container) (warnings ctrlwebhookadmission.Warnings, err error) {
    return nil, errors.New("webhook does not implement create")
}
// ValidateUpdate
if raw, ok := newContainer.Annotations[v1alpha1.AnnotationAction]; ok {
    if !v1alpha1.ContainerAction(raw).IsValid() {
        return nil, fmt.Errorf("invalid %s annotation %q", v1alpha1.AnnotationAction, raw)
    }
}
```

The webhook registers only for `Update`, so `ValidateUpdate` is the only
enforcer of the action enum. A client that POSTs a hand-written `Container`
with `metadata.annotations["containers.rancherdesktop.io/action"]=bogus`
passes admission because `ValidateCreate` is never invoked. `.spec` is now
optional (`container_types.go:233-238`), and the CRD no longer requires
`spec` at the top level (`crd.yaml:270-275`, `crd.yaml:493-500`), so the
barrier to a hand-written `Container` is lower than before the PR. The
reconciler's `!action.IsValid()` fallback at `docker_watcher.go:350`
catches invalid values, but a valid action on a hand-written mirror would
be dispatched against whatever container ID was put in `metadata.name`,
bypassing the documented `ContainerCreateRequest` flow. The new BATS
coverage at `engine-docker.bats:366-387` only exercises the Update-reject
path.

Fix: register the validator for `Create` as well and implement
`ValidateCreate` symmetrically with `ValidateUpdate`. If direct
Container creates should be rejected outright, keep the current
`errors.New("webhook does not implement create")` body and gate the
engine reconciler's own apply path (it uses `client.ForceOwnership`
with `engine-controller` as field owner, which bypasses admission for
controller-initiated applies only if configured that way). Add a BATS
test that `rdd ctl apply -f` of a hand-written Container with a bogus
action annotation fails admission on create.

I2. **`patchContainerLastAction` and `removeActionAnnotation` surface `NotFound` as an error** — `pkg/controllers/app/engine/controllers/docker_watcher.go:449-460,467-482` [Claude Opus 4.6]

```go
func (w *dockerWatcher) patchContainerLastAction(ctx context.Context, id string, lastAction *containersv1alpha1.ContainerLastAction) error {
    key := client.ObjectKey{Name: id, Namespace: w.apiNamespace}
    return retry.RetryOnConflict(retry.DefaultRetry, func() error {
        var latest containersv1alpha1.Container
        if err := w.k8s.Get(ctx, key, &latest); err != nil {
            return err
        }
        ...
    })
}
```

If the Container mirror is deleted (by `cleanupMirrorResources`, a
concurrent `docker rm` destroy event, or a user delete) between
`dispatchContainerAction` and these helpers, the `Get` returns
`NotFound`. `retry.RetryOnConflict` does not retry on `NotFound`, so the
error propagates up through `processContainerAction` (line 371) into the
joined error at `reconcileContainerActions:374`. The controller then
requeues and logs "failed to patch lastAction: NotFound" for a container
that no longer exists. The same log message fires for genuine
conflict-exhaustion failures, so the noise misdirects a debugger.

Fix: treat `NotFound` as success on both the `Get` and the `Patch`, in
both `patchContainerLastAction` and `removeActionAnnotation`.

```diff
 func (w *dockerWatcher) patchContainerLastAction(ctx context.Context, id string, lastAction *containersv1alpha1.ContainerLastAction) error {
     key := client.ObjectKey{Name: id, Namespace: w.apiNamespace}
     return retry.RetryOnConflict(retry.DefaultRetry, func() error {
         var latest containersv1alpha1.Container
         if err := w.k8s.Get(ctx, key, &latest); err != nil {
-            return err
+            return client.IgnoreNotFound(err)
         }
         patch := client.MergeFromWithOptions(latest.DeepCopy(), client.MergeFromWithOptimisticLock{})
         latest.Status.LastAction = lastAction
-        return w.k8s.Status().Patch(ctx, &latest, patch)
+        return client.IgnoreNotFound(w.k8s.Status().Patch(ctx, &latest, patch))
     })
 }
```

I3. **Status-patch-then-annotation-patch races the informer cache** — `pkg/controllers/app/engine/controllers/docker_watcher.go:370-375,449-482` [Gemini 3.1 Pro]

```go
if err := w.patchContainerLastAction(ctx, c.Name, lastAction); err != nil {
    return fmt.Errorf("failed to patch lastAction for %s: %w", c.Name, err)
}
if err := w.removeActionAnnotation(ctx, c.Name, raw); err != nil {
    return fmt.Errorf("failed to remove action annotation for %s: %w", c.Name, err)
}
```

`patchContainerLastAction` writes the status subresource, which bumps
the object's `resourceVersion`. `removeActionAnnotation` then re-Gets
via the cached client and patches with
`client.MergeFromWithOptimisticLock{}`, which attaches the Got
`resourceVersion` to the payload. Between the two writes, the informer
has not yet received the watch event for the status bump, so the Get
returns the stale pre-status version; the subsequent patch then 409s.
`retry.RetryOnConflict` re-runs the closure but re-reads the same cache,
so successive attempts keep seeing stale data until the informer catches
up. If the cache does not sync within `retry.DefaultRetry`'s ~310ms
budget, `removeActionAnnotation` returns the error,
`processContainerAction` propagates it, and the next reconcile replays
the whole Docker action because the annotation is still present. For
`start`/`stop`/`pause`/`unpause` the replay is a no-op (Docker returns
304 or the pre-check skips); for `restart` the replay is a real
double-restart.

Fix: thread the fresh `Container` object through from
`patchContainerLastAction` into `removeActionAnnotation`. The status
`Patch` updates the passed-in object in place with the new
`resourceVersion`, so `removeActionAnnotation` can skip its initial Get
on the happy path and patch against the known-fresh version.

```diff
-func (w *dockerWatcher) patchContainerLastAction(ctx context.Context, id string, lastAction *containersv1alpha1.ContainerLastAction) error {
+func (w *dockerWatcher) patchContainerLastAction(ctx context.Context, id string, lastAction *containersv1alpha1.ContainerLastAction) (*containersv1alpha1.Container, error) {
     key := client.ObjectKey{Name: id, Namespace: w.apiNamespace}
+    var latest containersv1alpha1.Container
-    return retry.RetryOnConflict(retry.DefaultRetry, func() error {
-        var latest containersv1alpha1.Container
+    err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
         if err := w.k8s.Get(ctx, key, &latest); err != nil {
             return err
         }
         patch := client.MergeFromWithOptions(latest.DeepCopy(), client.MergeFromWithOptimisticLock{})
         latest.Status.LastAction = lastAction
         return w.k8s.Status().Patch(ctx, &latest, patch)
     })
+    return &latest, err
 }
```

Note: preserving the current `observed != current` check in
`removeActionAnnotation` is important for the concurrent-rewrite path
(I1 design strength in Claude's and Gemini's observations). Threading
the fresh object through removes the cache-race on the common path
without losing that guard; if the threaded object shows the annotation
still equal to `observed`, the Patch uses its resourceVersion.

## Suggestions

S1. **Same-value action rewrites are indistinguishable from cleanup** — `pkg/controllers/app/engine/controllers/docker_watcher.go:462-480` [Codex GPT 5.4, Claude Opus 4.6]

```go
current, present := latest.Annotations[containersv1alpha1.AnnotationAction]
if !present || current != observed {
    return nil
}
```

`removeActionAnnotation` distinguishes concurrent writes by comparing the
annotation value to the value it dispatched. A user who writes `restart`
twice in quick succession (e.g., a deliberate re-trigger after the first
completes) has the second write silently consumed because `current ==
observed`. For idempotent actions (`start`/`stop`/`pause`/`unpause`) the
observable state is unchanged and this is fine. For `restart`, re-hitting
the "Restart" button does not produce a second restart. The design is
intentionally "latest wins, no queue"
(`api_containers.md:237-240`), so this is a documentation opportunity
rather than a code bug.

Fix: add a sentence to `api_containers.md` under "Change container state"
noting that a repeated same-value action is collapsed — the GUI needs to
rewrite to a different intermediate value (or let the first complete)
to force a second dispatch. Alternatively, encode a request nonce in the
annotation payload (e.g., `restart/abc123`) and compare by id instead of
by action name.

S2. **`LastAction.RequestedAt` name doesn't match what it records** — `pkg/apis/containers/v1alpha1/container_types.go:88-91`, `pkg/controllers/app/engine/controllers/docker_watcher.go:344` [Claude Opus 4.6]

```go
// RequestedAt is when the action was observed by the reconciler.
//
// +optional
RequestedAt metav1.Time `json:"requestedAt,omitempty,omitzero"`
```

```go
requestedAt := metav1.Now()
```

The JSON key reads as "when the UI requested the action"; the value is
actually "when the reconciler first observed the annotation". Under load
the two can differ by seconds. Since the CRD has not shipped, renaming
is cheap.

Fix: rename to `observedAt` (field, JSON tag, apply configuration, CRD
enum, doc examples), or keep the name and sharpen the docstring: "the
time the reconciler first observed the action annotation; backlog may
delay it relative to the user's write time."

S3. **The `!action.IsValid()` reconciler fallback has no test** — `pkg/controllers/app/engine/controllers/docker_watcher.go:346-353` [Claude Opus 4.6]

```go
// The webhook rejects invalid action values, but one written while the
// webhook is offline can still reach storage. Drop such values here;
// otherwise the CRD enum rejects the status.lastAction write, the
// annotation stays in place, and every reconcile retries forever.
if !action.IsValid() {
    log.Info("Dropping invalid container action annotation", "id", c.Name, "action", raw)
    return w.removeActionAnnotation(ctx, c.Name, raw)
}
```

The comment describes a real failure mode that BATS cannot easily exercise
(webhook offline). A unit test against a fake client that constructs a
Container with `annotations[AnnotationAction]="bogus"`, calls
`processContainerAction`, and asserts the annotation was removed and
`lastAction` was not written would close the gap without an integration
tail. If I1 is applied, the exposure shrinks but doesn't disappear — the
comment specifically targets the webhook-offline case.

S4. **Restart replay after crash is more disruptive than "harmless"** — `docs/design/api_containers.md:242-248`, `pkg/controllers/app/engine/controllers/docker_watcher.go:331-335,426-429` [Claude Opus 4.6]

```
# docker_watcher.go:333-335
# Start, stop, pause, unpause, and restart are all
# idempotent against a container already in the target state, so replay
# is safe; restart is the exception and may restart the container twice.
```

```
# api_containers.md:246-248
# Restart has no target state to match,
# so a replay after a crash runs a second restart; the double restart
# is harmless.
```

"Harmless" understates it. `ContainerRestart` sends SIGTERM and waits the
container's grace period; a double-restart during graceful shutdown can
interrupt a long-running workload (database drain, mid-request HTTP
handler) twice in quick succession.

Fix: reword as "a second restart interrupts whatever the container was
doing; the controller cannot distinguish a crash-replay from a
deliberate second request."

S5. **`AnnotationAction` is a package-level `var`** — `pkg/apis/containers/v1alpha1/container_types.go:33` [Claude Opus 4.6]

```go
var AnnotationAction = SchemeGroupVersion.Group + "/action"
```

`SchemeGroupVersion.Group` is a struct field, so this expression cannot be
a Go `const`. Any package can reassign `containersv1alpha1.AnnotationAction`
at runtime. Kubernetes uses `const` for similar annotation keys and
spells the group out literally
(`kubectl.kubernetes.io/last-applied-configuration`). Matching that
idiom trades one line of duplicated group name for compile-time
immutability.

Fix:

```diff
-var AnnotationAction = SchemeGroupVersion.Group + "/action"
+const AnnotationAction = "containers.rancherdesktop.io/action"
```

S6. **Two `rdd ctl get` calls per `assert_last_action`** — `bats/tests/32-app-controllers/engine-docker.bats:266-274` [Claude Opus 4.6]

```bash
assert_last_action() {
    local cid=$1 action=$2 state=$3
    run -0 rdd ctl get container "${cid}" --namespace="${RDD_NAMESPACE}" \
        -o jsonpath='{.status.lastAction.action}'
    assert_output "${action}"
    run -0 rdd ctl get container "${cid}" --namespace="${RDD_NAMESPACE}" \
        -o jsonpath='{.status.lastAction.state}'
    assert_output "${state}"
}
```

Two roundtrips and two snapshots of the object. In practice the tests
don't race their own assertions, but combining both reads into one
jsonpath halves the API traffic and asserts both fields against the same
object version:

```diff
 assert_last_action() {
     local cid=$1 action=$2 state=$3
     run -0 rdd ctl get container "${cid}" --namespace="${RDD_NAMESPACE}" \
-        -o jsonpath='{.status.lastAction.action}'
-    assert_output "${action}"
-    run -0 rdd ctl get container "${cid}" --namespace="${RDD_NAMESPACE}" \
-        -o jsonpath='{.status.lastAction.state}'
-    assert_output "${state}"
+        -o jsonpath='{.status.lastAction.action}={.status.lastAction.state}'
+    assert_output "${action}=${state}"
 }
```

---

## Design Observations

### Strengths

- **Implicit "in-progress" via annotation presence (in-scope)** [Gemini 3.1 Pro] — Using the presence of the annotation as the implied "in-progress" indicator avoids a third, latency-inducing API write (e.g., `state=InProgress`) before dispatching the Docker call. Clean signal for the GUI.
- **Concurrent-rewrite safety via value-match (in-scope)** [Claude Opus 4.6, Gemini 3.1 Pro] — `removeActionAnnotation`'s `current != observed` check correctly preserves a concurrent rewrite to a different action (e.g., a user switching `pause` to `stop` mid-flight). This is a small amount of code doing a lot of work.
- **SSA simplification follows naturally (in-scope)** [Codex GPT 5.4, Claude Opus 4.6] — Collapsing the two SSA field managers (`engine-controller` and `engine-controller-finalizer`) into one is a direct consequence of spec becoming empty. The old dance existed only because `engine-controller` owned `spec.state`. Nothing to own, nothing to dance around.
- **Asymmetric pre-check is correct (in-scope)** [Claude Opus 4.6, Gemini 3.1 Pro] — Pre-checking Docker state for `pause`/`unpause` but not `start`/`stop` is intentionally asymmetric. Docker returns 304 on redundant start/stop but silently succeeds on unpause against a non-paused container; the pre-check turns that silent success into a documented failure. The explanatory comment at `docker_watcher.go:383-390` makes the asymmetry legible.
- **Status-only mirror is the right abstraction (in-scope)** [Codex GPT 5.4] — Abandoning the level-triggered `spec.state` in favor of a status-only mirror with edge-triggered intent is a better fit for a resource whose authoritative state lives outside K8s.

### Concerns

- **Reconciler full-lists the container list on every tick (future)** [Claude Opus 4.6] — `reconcileContainerActions` Lists all containers in the namespace on every reconcile, and every Container/Image/Volume watch event triggers a reconcile. With N containers and M events/second, cost is O(N·M). The comment at `engine_reconciler.go:186-195` already captures this as the long-term plan (per-resource reconcilers with watch predicates); this PR's new flow doubles the per-reconcile list count (`reconcileContainerActions` + `processFinalizers`). Noted but explicitly deferred.
- **One-shot annotation drops same-value re-requests (future)** [Codex GPT 5.4, Claude Opus 4.6] — See S1 for the same observation framed as a documentation suggestion. For idempotent actions this is fine; for `restart` it means a deliberate re-trigger is silently collapsed. Design is intentional ("latest wins, no queue"); a doc note would close the loop.

## Testing Assessment

The five new action tests (`engine-docker.bats:276-408`) cover happy paths
(start, stop, pause+unpause toggle, restart) plus two failure paths
(pause-on-stopped, unpause-on-stopped) and webhook-reject for invalid
action values. The pre-check logic on pause/unpause is exercised. Gaps,
ranked by risk:

1. **`!action.IsValid()` reconciler fallback** (S3). No way to hit it via
   BATS unless the webhook is disabled. A unit test against a fake client
   would close this.
2. **Webhook rejection on Create** (conditional on I1 being applied).
3. **Concurrent rewrite during dispatch**. A BATS test that annotates
   `action=stop`, waits briefly, annotates `action=start`, and asserts
   the final state is `running` would exercise the `observed != current`
   skip path.
4. **Action while mirror is mid-deletion**. Exercises the
   `DeletionTimestamp != nil` skip in `reconcileContainerActions:367` and
   the `NotFound` handling I2 targets.
5. **Status-patch-then-annotation-patch race (I3)**. Hard to trigger
   deterministically in BATS; a unit test with a lagging fake client
   could force the stale-cache window.

## Documentation Assessment

`docs/design/api_containers.md` is rewritten cleanly around the action
model with both Succeeded and Failed `lastAction` examples. Specific
gaps:

- The doc at `api_containers.md:155-159` shows a `Succeeded` `lastAction`
  without an `error` field; the Failed example at 255-262 shows one. A
  one-liner — "`error` is set only when `state: Failed`" — would remove
  the ambiguity for skimmers.
- `api_containers.md:246-248` "harmless double-restart" wording is
  addressed by S4.
- Missing note on `lastAction` persistence semantics: it survives
  observable state changes (`docker restart` of the underlying container,
  `docker stop`/`start`, etc.) because sync doesn't declare ownership of
  `lastAction`. Worth one sentence so GUI implementers don't assume
  `lastAction` is cleared on state transitions.

## Commit Structure

One commit, `6251b90 engine: replace Container.spec.state with action
annotations`. The message accurately describes the change, motivates it,
and covers the edge cases (crash recovery, duplicate-dispatch pre-check,
webhook-offline drop). Scope is a single conceptual unit — API type
change, reconciler change, webhook change, CRD regeneration, design doc,
tests. Stacked on #309 and a prerequisite for #315 per the PR description.

## Acknowledged Limitations

- **One-annotation, no queue** — PR description and
  `api_containers.md:237-240`. A new write replaces the pending action;
  same-value re-requests collapse. This is intentional and is the reason
  S1 asks for a doc note rather than code changes.
- **Restart double-execution on crash** —
  `docker_watcher.go:333-335` and `api_containers.md:246-248`. Addressed
  by S4 (wording) rather than code (pre-check with StartedAt/CompletedAt
  window added complexity for a rare scenario).
- **Per-reconcile list cost** — `engine_reconciler.go:186-195`. Long-term
  plan is per-resource reconcilers with watch predicates; this PR keeps
  the same tradeoff.
- **No CLI support for actions** — author comment on the PR. The action
  API is GUI-facing by design; CLI users are directed to
  `docker`/`nerdctl` directly. Documented in `api_containers.md:264-266`.

---

## Agent Performance Retro

### [Claude]

Caught both I1 (webhook on Create) and I2 (NotFound propagation) plus a
dense suggestion list focused on polish (naming, const vs var,
test-helper consolidation, doc wording). The I2 analysis is
Claude-only: the other two agents looked at the same helpers but did
not notice the `NotFound` path. S4's "restart double-execution is
more disruptive than harmless" is also uniquely Claude's — it pushed
back on the author's own phrasing. Balanced output: nothing felt
inflated, and the suggestions are all actionable.

### [Codex]

Converged with Claude on I1 — same root cause (validator registered
for Update only), different framing. Codex additionally raised S1
(same-value rewrites collapse), which Claude surfaced as a design
concern rather than a numbered finding; framing as a numbered
suggestion makes it easier to act on. Coverage was clean, and the
pass had no false positives. Shorter output than Claude, but what's
there is correct.

### [Gemini]

Initial launch hit a `429 MODEL_CAPACITY_EXHAUSTED` retry loop for
~10 minutes on the preview model; killed and relaunched after the user
updated `gemini-cli`. The second run produced a single, sharp finding:
I3 (status-patch-then-annotation-patch cache race). This was missed by
both Claude and Codex despite them both reading the same helpers. The
fix Gemini proposed — thread the fresh object through — is the right
shape, though it needs a small refinement to preserve the
concurrent-rewrite guard on the `current != observed` path. Gemini
marked `container_webhooks.go` as "Reviewed, no issues", missing I1;
a coverage gap.

### Summary

| | Claude Opus 4.6 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 13m 58s | 5m 06s | 6m 52s |
| Findings | 2I 5S | 1I 1S | 1I |
| Tool calls | 49 (Grep 21, Read 14, Bash 9) | 43 (shell 42, stdin 1) | 28 (run_shell_command 16, grep_search 6, write_file 4) |
| Design observations | 6 | 2 | 3 |
| False positives | 0 | 0 | 0 |
| Unique insights | 3 | 1 | 1 |
| Files reviewed | 15 | 15 | 15 |
| Coverage misses | 0 | 0 | 1 |
| **Totals** | **2I 5S** | **1I 1S** | **1I** |
| Downgraded | 0 | 0 | 0 |
| Dropped | 0 | 0 | 0 |


Overall, each agent contributed exactly one independent finding that
the others missed (Claude: I2; Codex: S1; Gemini: I3). Claude was the
highest-volume and most polished; Codex was the most concise with the
cleanest coverage; Gemini's single catch (I3) was the subtlest of the
three and would likely have been missed entirely without it.

## Review Process Notes

None for this review. The existing prompts did their job; the
findings that emerged were all naturally derived from the stated
dimensions (admission symmetry, subresource-patch races, error
propagation, naming). No skill-level generalisation suggests itself
without stretching into specifics of this PR, and `deep-review-context.md`
was not contradicted by anything under review.

---

## Resolution

| | |
|---|---|
| **Addressed** | 2026-04-16 |
| **Commit** | `397d40a engine: replace Container.spec.state with action annotations` (fixup-squashed into the original) |

| # | Finding | Action |
|---|---------|--------|
| 1 | Important #1: Action-annotation validation absent on Create | Fixed |
| 2 | Important #2: patchContainerLastAction/removeActionAnnotation surface NotFound as error | Fixed |
| 3 | Important #3: Status-patch-then-annotation-patch races the informer cache | Fixed |
| 4 | Suggestion #1: Same-value action rewrites indistinguishable from cleanup | Skipped |
| 5 | Suggestion #2: LastAction.RequestedAt name doesn't match what it records | Fixed (renamed to observedAt) |
| 6 | Suggestion #3: !action.IsValid() reconciler fallback has no test | Skipped |
| 7 | Suggestion #4: Restart replay after crash more disruptive than "harmless" | Fixed (reworded doc and comment) |
| 8 | Suggestion #5: AnnotationAction is a package-level var | Fixed (const with literal group) |
| 9 | Suggestion #6: Two rdd ctl get calls per assert_last_action | Fixed (single jsonpath) |
| 10 | Testing Gap #1: No test for !action.IsValid() fallback | Skipped |
| 11 | Testing Gap #2: Webhook rejection on Create untested | Test written |
| 12 | Testing Gap #3: Concurrent rewrite during dispatch untested | Skipped |
| 13 | Testing Gap #4: Action while mirror is mid-deletion untested | Skipped |
| 14 | Testing Gap #5: Status-patch-then-annotation-patch race untested | Skipped |
| 15 | Documentation Gap #1: error field set only when state: Failed | Documentation updated |
| 16 | Documentation Gap #2: lastAction persistence semantics undocumented | Documentation updated |

---

## Appendix: Original Reviews

### [Claude] — Pass 1

Let me check some more details about webhook setup and how SSA interacts with it.

Now I have enough context to write the review.

# Review of PR #316 — engine: replace Container.spec.state with action annotations

| | |
|---|---|
| **Verdict** | Approve with minor fixes |
| **Scope** | 15 files, +530 / −261 |

## Executive Summary

The PR replaces the level-triggered `Container.spec.state` field with a one-shot `containers.rancherdesktop.io/action` annotation, with outcome reflected in `status.lastAction`. The design is correct: it sidesteps the Docker-vs-reconciler write-war that motivated the change, the pause/unpause pre-checks address the documented informer-cache replay hazard, and the SSA simplification (dropping the two-field-owner dance) is a direct consequence of spec being empty.

The main gaps are on the edges: the admission webhook validates the action annotation only on `Update`, leaving `Create` as an unguarded path; `patchContainerLastAction` surfaces `NotFound` as an error on a self-healing race; and a couple of smaller naming / documentation mismatches.

## Findings

### Critical Issues

None.

### Important Issues

**I1. Action-annotation validation absent on `Create` — `pkg/controllers/containers/container/container_controller.go:82-92`, `pkg/controllers/containers/container/container_webhooks.go:24-26,41-45` (important, gap)**

```go
// container_controller.go
Operations: []admissionregistrationv1.OperationType{
    admissionregistrationv1.Update,
},
Validator: &immutableValidator{},
```

```go
// container_webhooks.go
func (c *immutableValidator) ValidateCreate(context.Context, *v1alpha1.Container) (warnings ctrlwebhookadmission.Warnings, err error) {
    return nil, errors.New("webhook does not implement create")
}
...
// ValidateUpdate
if raw, ok := newContainer.Annotations[v1alpha1.AnnotationAction]; ok {
    if !v1alpha1.ContainerAction(raw).IsValid() {
        return nil, fmt.Errorf("invalid %s annotation %q", v1alpha1.AnnotationAction, raw)
    }
}
```

The webhook registers only for `Update` (line 89). A client that POSTs a new `Container` with `metadata.annotations["containers.rancherdesktop.io/action"]=bogus` passes admission — `ValidateCreate` is never invoked because the webhook isn't subscribed to `Create`. The reconciler's `!action.IsValid()` fallback at `docker_watcher.go:350` catches this later, but the BATS suite (`engine-docker.bats:366-387`) only exercises the `Update`-rejection path. The review context explicitly calls out the symmetry rule: validations should apply on every operation the field is mutable on. Annotations are mutable on `Create`.

The current `ValidateCreate` returns `errors.New("webhook does not implement create")`, so simply adding `admissionregistrationv1.Create` to the `Operations` list would reject every engine-initiated Container create. The fix is two coordinated changes.

Fix:

```diff
 // container_controller.go
 Operations: []admissionregistrationv1.OperationType{
+    admissionregistrationv1.Create,
     admissionregistrationv1.Update,
 },
```

```diff
 // container_webhooks.go
-func (c *immutableValidator) ValidateCreate(context.Context, *v1alpha1.Container) (warnings ctrlwebhookadmission.Warnings, err error) {
-    return nil, errors.New("webhook does not implement create")
+func (c *immutableValidator) ValidateCreate(_ context.Context, newContainer *v1alpha1.Container) (warnings ctrlwebhookadmission.Warnings, err error) {
+    if raw, ok := newContainer.Annotations[v1alpha1.AnnotationAction]; ok {
+        if !v1alpha1.ContainerAction(raw).IsValid() {
+            return nil, fmt.Errorf("invalid %s annotation %q", v1alpha1.AnnotationAction, raw)
+        }
+    }
+    return nil, nil
 }
```

Follow with a BATS test that submits a fresh Container via `rdd ctl apply` with a bogus action annotation and asserts admission rejection.

**I2. `patchContainerLastAction` returns `NotFound` as an error — `pkg/controllers/app/engine/controllers/docker_watcher.go:449-460` (important, regression)**

```go
func (w *dockerWatcher) patchContainerLastAction(ctx context.Context, id string, lastAction *containersv1alpha1.ContainerLastAction) error {
    key := client.ObjectKey{Name: id, Namespace: w.apiNamespace}
    return retry.RetryOnConflict(retry.DefaultRetry, func() error {
        var latest containersv1alpha1.Container
        if err := w.k8s.Get(ctx, key, &latest); err != nil {
            return err
        }
        ...
    })
}
```

If the Container mirror is deleted (by `cleanupMirrorResources`, a concurrent `docker rm` destroy event, or a user delete) between `dispatchContainerAction` and `patchContainerLastAction`, the `Get` at line 453 returns `NotFound`. `retry.RetryOnConflict` does not retry on `NotFound`, so the closure's `NotFound` propagates up through `processContainerAction` (line 371) and into `reconcileContainerActions`' joined error at line 374.

The controller then requeues and logs a spurious "failed to patch lastAction: NotFound" error for a container that no longer exists. The next reconcile self-heals because the deleted container isn't in the list any more, but the log noise misdirects a debugger — the same log message fires for genuine 409-exhaustion failures, which are real bugs.

Fix: treat `NotFound` as a success at both steps. `removeActionAnnotation` (line 467-482) has the same exposure and wants the same fix.

```diff
 func (w *dockerWatcher) patchContainerLastAction(ctx context.Context, id string, lastAction *containersv1alpha1.ContainerLastAction) error {
     key := client.ObjectKey{Name: id, Namespace: w.apiNamespace}
     return retry.RetryOnConflict(retry.DefaultRetry, func() error {
         var latest containersv1alpha1.Container
         if err := w.k8s.Get(ctx, key, &latest); err != nil {
-            return err
+            return client.IgnoreNotFound(err)
         }
         patch := client.MergeFromWithOptions(latest.DeepCopy(), client.MergeFromWithOptimisticLock{})
         latest.Status.LastAction = lastAction
-        return w.k8s.Status().Patch(ctx, &latest, patch)
+        return client.IgnoreNotFound(w.k8s.Status().Patch(ctx, &latest, patch))
     })
 }
```

### Suggestions

**S1. `LastAction.RequestedAt` name doesn't match what it records — `pkg/apis/containers/v1alpha1/container_types.go:88-91`, `docker_watcher.go:344` (suggestion, regression)**

```go
// container_types.go
// RequestedAt is when the action was observed by the reconciler.
//
// +optional
RequestedAt metav1.Time `json:"requestedAt,omitempty,omitzero"`
```

```go
// docker_watcher.go
requestedAt := metav1.Now()
```

The field name says "requested"; the docstring says "observed by the reconciler"; the code assigns `metav1.Now()` inside `processContainerAction`, which is the reconciler's observation time, not the user's write time. A consumer looking only at the JSON key `requestedAt` would read it as "when the UI annotated the container". If the reconciler queue backs up, the two can differ by seconds or minutes. Since the CRD has not shipped, renaming is cheap.

Fix: rename to `observedAt` (field, JSON tag, `ContainerLastActionApplyConfiguration.RequestedAt`, CRD schema in `crd.yaml:404-408`, docs YAML examples in `api_containers.md:156-159,258-262`, and the variable in `docker_watcher.go:344`). Alternatively keep the name and fold the discrepancy into a clearer docstring: "RequestedAt records when the reconciler first observed the annotation; a backlog may delay it relative to the user's write."

**S2. The `!action.IsValid()` fallback has no test — `pkg/controllers/app/engine/controllers/docker_watcher.go:346-353` (suggestion, gap)**

```go
// The webhook rejects invalid action values, but one written while the
// webhook is offline can still reach storage. Drop such values here;
// otherwise the CRD enum rejects the status.lastAction write, the
// annotation stays in place, and every reconcile retries forever.
if !action.IsValid() {
    log.Info("Dropping invalid container action annotation", "id", c.Name, "action", raw)
    return w.removeActionAnnotation(ctx, c.Name, raw)
}
```

The comment describes a real failure mode (webhook offline → invalid annotation → CRD enum rejects `status.lastAction` write → the annotation is never consumed → infinite reconcile loop), but the only coverage is via the webhook-reject test at `engine-docker.bats:366-387`. A unit test is the easiest way to exercise this path: construct a Container with `annotations[AnnotationAction]="bogus"` against a fake client, call `processContainerAction`, assert the annotation was removed and `lastAction` was not written. If I1 is not applied, the exposure is larger because `Create` accepts invalid annotations unconditionally.

**S3. Restart replay after crash may surprise users more than the doc suggests — `docs/design/api_containers.md:242-248`, `pkg/controllers/app/engine/controllers/docker_watcher.go:331-335,426-429` (suggestion, gap)**

```
// docker_watcher.go:333-335
// Start, stop, pause, unpause, and restart are all
// idempotent against a container already in the target state, so replay
// is safe; restart is the exception and may restart the container twice.
```

```
# api_containers.md:246-248
# Restart has no target state to match,
# so a replay after a crash runs a second restart; the double restart
# is harmless.
```

"Harmless" understates it. `ContainerRestart` sends SIGTERM and waits the container's grace period; a double-restart during graceful shutdown can leave a long-running workload (database drain, mid-request HTTP handler) killed twice in quick succession. For GUI-driven use this is typically fine, but the wording suggests "no observable effect", which isn't accurate. Consider either:

1. Rewording as "a second restart interrupts whatever the container was doing; the controller cannot distinguish a crash-replay from a deliberate second request."
2. Pre-checking `State.Running` + `State.StartedAt` against `CompletedAt` to short-circuit if the container was freshly started within a narrow window. This adds complexity for a rare crash-replay scenario, so (1) is the lighter fix.

**S4. `AnnotationAction` is a package-level `var` — `pkg/apis/containers/v1alpha1/container_types.go:33` (suggestion, regression)**

```go
var AnnotationAction = SchemeGroupVersion.Group + "/action"
```

`SchemeGroupVersion` is a struct value, so this expression is not a constant and has to be a `var`. Any package can reassign `containersv1alpha1.AnnotationAction`. Kubernetes uses `const` for similar annotation keys and spells the group out literally (`kubectl.kubernetes.io/last-applied-configuration`). Matching that idiom trades one line of duplicated group name for compile-time immutability.

Fix:

```diff
-var AnnotationAction = SchemeGroupVersion.Group + "/action"
+const AnnotationAction = "containers.rancherdesktop.io/action"
```

**S5. Two separate `rdd ctl get` calls per `assert_last_action` — `bats/tests/32-app-controllers/engine-docker.bats:266-274` (suggestion, gap)**

```bash
assert_last_action() {
    local cid=$1 action=$2 state=$3
    run -0 rdd ctl get container "${cid}" --namespace="${RDD_NAMESPACE}" \
        -o jsonpath='{.status.lastAction.action}'
    assert_output "${action}"
    run -0 rdd ctl get container "${cid}" --namespace="${RDD_NAMESPACE}" \
        -o jsonpath='{.status.lastAction.state}'
    assert_output "${state}"
}
```

Two `rdd ctl get` calls means two roundtrips to the apiserver and two snapshots of the object. In practice the tests don't race their own assertions, but combining both reads into one jsonpath would halve the api traffic and assert both fields against the same object version:

```diff
 assert_last_action() {
     local cid=$1 action=$2 state=$3
     run -0 rdd ctl get container "${cid}" --namespace="${RDD_NAMESPACE}" \
-        -o jsonpath='{.status.lastAction.action}'
-    assert_output "${action}"
-    run -0 rdd ctl get container "${cid}" --namespace="${RDD_NAMESPACE}" \
-        -o jsonpath='{.status.lastAction.state}'
-    assert_output "${state}"
+        -o jsonpath='{.status.lastAction.action}={.status.lastAction.state}'
+    assert_output "${action}=${state}"
 }
```

## Design Observations

### Strengths

- Collapsing the two SSA field managers (`engine-controller` + `engine-controller-finalizer`) into one after spec went empty is a clean simplification — the old dance was necessary only because controllerName owned `spec.state`, and an SSA release would prune the field and trip CRD required-field validation.
- Pre-checking Docker state for `pause`/`unpause` but not `start`/`stop` is asymmetric but correct: Docker returns 304 on redundant start/stop, but silently succeeds on `unpause` against a non-paused container, and the unpause-on-stopped-container path would otherwise surface as a silent no-op instead of a documented failure. The comment at `docker_watcher.go:383-390` explains this well.
- The `removeActionAnnotation` value-match check (`current != observed` → skip) at `docker_watcher.go:474-477` correctly preserves a concurrent rewrite to a different action. The design documents this as "single annotation holds at most one pending action", which is clean.
- `applyContainer`'s merged create/reassert branch (`sync_containers.go:109-120`) gets simpler precisely because the spec field is no longer subject to "don't clobber the user's state patch" concerns.

### Concerns

- **One-shot annotation loses re-requests of the same action (in-scope)**. If the user writes `action=pause` twice in quick succession (e.g., a jittery UI or a deliberate re-trigger), `removeActionAnnotation` reads `current == observed` and removes — the second write is silently consumed. For idempotent actions (start/stop/pause/unpause) the observable state is the same and the user gets what they asked for, so this is mostly fine. For `restart`, a user hitting the "Restart" button twice to force a second restart after the first completed wouldn't get it. The PR description frames this as intentional (see acknowledged limitation below); it's worth documenting explicitly in `api_containers.md` so GUI implementers know the semantics.

- **Reconciler iterates the full Container list on every reconcile (in-scope, pre-existing)**. `reconcileContainerActions` lists all containers in the namespace on every reconcile, including every Container/Image/Volume watch event. With N containers and M events/sec, cost is O(N·M). A watch predicate filtering to containers with `AnnotationAction` present would bring this to O(touched). The existing comment at `engine_reconciler.go:186-195` already captures this as the long-term plan, so this is just a reminder that the action flow doubles the per-reconcile list count (`reconcileContainerActions` + `processFinalizers` each list all containers).

## Testing Assessment

The five new action tests cover happy paths (start, stop, pause+unpause toggle, restart) plus two failure paths (pause-on-stopped, unpause-on-stopped) and the webhook-reject path for invalid action values. The pre-check logic on pause/unpause is exercised. Untested, ranked by risk:

1. **The `!action.IsValid()` reconciler fallback at `docker_watcher.go:350`**. No way to hit it via BATS unless webhook is disabled. A unit test against a fake client would close this.
2. **Webhook rejection on `Create`** (conditional on I1 being applied). If `Create` validation is added, a test that submits a Container with `kubectl create -f` or `rdd ctl apply -f` carrying a bogus annotation would verify symmetry.
3. **Concurrent rewrite during dispatch**. A test that annotates `action=stop`, waits 200ms, annotates `action=start`, and asserts the final state is `running` would exercise the `observed != current` skip path in `removeActionAnnotation`. Currently only the single-action path is covered.
4. **Action while container mirror is mid-deletion**. A test that deletes a Container mirror (triggering the finalizer) while the reconciler is processing an action against it would exercise the `DeletionTimestamp != nil` skip in `reconcileContainerActions:367` and the `NotFound` handling I2 targets.
5. **Action on a restarting container**. `containerRunState` snapshots Docker's live state; a restart-in-progress (running=false momentarily) would flip an unpause request to `is not running` failure. Not necessarily wrong, but worth documenting if it matters for the GUI.

## Documentation Assessment

`api_containers.md` is rewritten cleanly around the action model, with a Succeeded and a Failed `lastAction` example. A few gaps:

- The doc example at `api_containers.md:155-159` shows a `Succeeded` `lastAction` without an `error` field, and the `Failed` example at 255-262 shows it with one. That's accurate to the code (`ContainerLastAction.Error` is `+optional`), but readers who skim only one example might miss the correlation. A one-line note ("`error` is set only when `state: Failed`") would help.
- `api_containers.md:246-248` — see S3, "harmless double-restart" is too optimistic.
- Missing note: `lastAction` persists across the next reconcile of the container spec/status. That's the design (it's documented as "persists until the next action overwrites it"), but nothing documents that `lastAction` survives a `docker restart` of the underlying container, a `docker stop` + `docker start`, etc. — SSA doesn't clear it because the sync doesn't declare ownership. Worth a sentence so GUI implementers don't build state machines that assume `lastAction` is cleared on observable state changes.
- `AnnotationAction` godoc on `container_types.go:29-33` is good. `ContainerLastAction`, `ContainerAction`, `ContainerActionState` all have clean doc comments. `IsValid`'s godoc ("reports whether a is one of the defined ContainerAction values") follows Go conventions.

## Commit Structure

One commit, `6251b90 engine: replace Container.spec.state with action annotations`. Message accurately describes the change, motivates it, covers the edge cases (crash recovery, duplicate-dispatch pre-check, webhook bypass). Scope is a single conceptual unit — API type change, reconciler change, webhook change, CRD regeneration, design doc, tests. No history cleanup needed.

PR stacks on #309 and is a prerequisite for #315, which the description flags explicitly.

## Acknowledged Limitations

- **"Don't plan to address findings in stacked PR #309"** — `jandubois` general comment. The stacked-PR context means #309-specific findings are out of scope here; my review is strictly for #316's diff (which I've confirmed against `git blame`).
- **"No CLI support for setting container actions"** — `jandubois` general comment. The action API is UI-facing by design; CLI users are directed to `docker`/`nerdctl` directly. This is documented in `api_containers.md:264-266` and in the PR description. I agree — the mirror is the GUI's view, not a CLI primitive.
- **`reconcileContainerActions`/`processFinalizers` list all resources per reconcile** — comment at `engine_reconciler.go:186-195`. The long-term plan is per-resource reconcilers with watch predicates. This PR's new `reconcileContainerActions` inherits the same shape (full list per reconcile). Noted above as a Design Concern but already acknowledged.
- **"A single annotation holds at most one pending action; writing a new value replaces the old"** — PR description and `api_containers.md:237-240`. This intentionally drops same-value re-requests (see Design Concerns above) — the design is "latest wins", no queue, no history.

## Unresolved Feedback

None. The PR has one general author comment (dismissing CLI support for the action API) and no thread comments on the commit under review.

## Coverage Summary

| File | Status |
|---|---|
| `bats/tests/32-app-controllers/engine-docker.bats` | S5 (+ tests listed in Testing Assessment) |
| `docs/design/api_containers.md` | S3 (+ notes in Documentation Assessment) |
| `pkg/apis/containers/v1alpha1/applyconfiguration/containers/v1alpha1/container.go` | Reviewed, no issues — generated, `WithSpec` signature change safe because no callers. |
| `pkg/apis/containers/v1alpha1/applyconfiguration/containers/v1alpha1/containerlastaction.go` | Reviewed, no issues — generated. |
| `pkg/apis/containers/v1alpha1/applyconfiguration/containers/v1alpha1/containerspec.go` | Trivial — deleted, consistent with empty `ContainerSpec`. |
| `pkg/apis/containers/v1alpha1/applyconfiguration/containers/v1alpha1/containerstatus.go` | Reviewed, no issues — generated, adds `WithLastAction`. |
| `pkg/apis/containers/v1alpha1/applyconfiguration/utils.go` | Trivial — generated kind registration. |
| `pkg/apis/containers/v1alpha1/container_types.go` | S1, S4 |
| `pkg/apis/containers/v1alpha1/zz_generated.deepcopy.go` | Reviewed, no issues — generated. |
| `pkg/controllers/app/engine/controllers/docker_watcher.go` | I2, S2, S3 |
| `pkg/controllers/app/engine/controllers/engine_reconciler.go` | Reviewed, no issues — rename `reconcileContainerSpecs` → `reconcileContainerActions` and removal of `finalizerFieldOwner` are both consequences of the spec change, cleanly done. |
| `pkg/controllers/app/engine/controllers/sync_containers.go` | Reviewed, no issues — merged create/reassert branch correct, `ForceOwnership` safe because spec is empty. |
| `pkg/controllers/containers/container/container_webhooks.go` | I1 |
| `pkg/controllers/containers/container/crd.yaml` | Reviewed, no issues — regenerated from types, schema matches Go types. |
| `pkg/controllers/mock/container_reconciler.go` | Reviewed, no issues — correctly drops `WithSpec`/state mapping, mock output is unchanged from consumer perspective. |
---

### [Codex] — Pass 1

### Executive Summary

This change correctly moves container control from a level-triggered spec field to edge-triggered action annotations, and the controller/docs/test updates are mostly aligned with that model. I found one important API gap: the new action validation only runs on updates, while this PR also makes bare `Container` creates easy enough that a create can now bypass both validation and the intended `ContainerCreateRequest` flow; I also found one semantic edge case in the annotation cleanup path where reissuing the same action while the first request is still in flight is silently dropped.

### Findings

**Critical Issues**

None.

**Important Issues**

I1. **Create requests bypass the new action-validation path** — `pkg/controllers/containers/container/container_controller.go:84-91` (important, regression)

```go
mutatingConfig := base.WebhookConfig[*v1alpha1.Container]{
	Name:        "container-mutating",
	WebhookName: "container-mutating.containers.rancherdesktop.io",
	WebhookPort: c.webhookPort,
	Operations: []admissionregistrationv1.OperationType{
		admissionregistrationv1.Update,
	},
	Validator: &immutableValidator{},
}
```

The PR moves the control surface onto `metadata.annotations` and makes `.spec` optional (`pkg/apis/containers/v1alpha1/container_types.go:233-238`), and the CRD no longer requires `spec` at the top level (`pkg/controllers/containers/container/crd.yaml:270-275` and `pkg/controllers/containers/container/crd.yaml:493-500`). At that point, `immutableValidator.ValidateUpdate()` becomes the only place that rejects bad action values (`pkg/controllers/containers/container/container_webhooks.go:37-44`), but the webhook is still registered for `Update` only at `pkg/controllers/containers/container/container_controller.go:88-90`. A client can now create a bare `Container` with `containers.rancherdesktop.io/action=...` directly, bypassing both admission validation and the documented `ContainerCreateRequest` creation path (`docs/design/api_containers.md:105-107` and `docs/design/api_containers.md:179`). `reconcileContainerActions()` at `pkg/controllers/app/engine/controllers/engine_reconciler.go:351-377` will process that object just like a mirror, so a valid action on create is dispatched against whatever container ID was put in `metadata.name`, while an invalid action is accepted and only later scrubbed by `processContainerAction()` at `pkg/controllers/app/engine/controllers/docker_watcher.go:346-352`. The new BATS coverage at `bats/tests/32-app-controllers/engine-docker.bats:366-384` only exercises the update path, so this inconsistency currently ships untested.

Fix: register the validator for `Create` as well as `Update`, and reject direct `Container` creates unless they are engine-owned mirrors. If you want to keep bare create technically possible, at minimum reject any create that carries `containers.rancherdesktop.io/action`. Add an integration test that `rdd ctl apply -f` of a hand-written `Container` with the action annotation fails on create.

**Suggestions**

S1. **Same-value action rewrites are indistinguishable from the old request** — `pkg/controllers/app/engine/controllers/docker_watcher.go:462-480` (suggestion, regression)

```go
current, present := latest.Annotations[containersv1alpha1.AnnotationAction]
if !present || current != observed {
	return nil
}
patch := client.MergeFromWithOptions(latest.DeepCopy(), client.MergeFromWithOptimisticLock{})
delete(latest.Annotations, containersv1alpha1.AnnotationAction)
return w.k8s.Patch(ctx, &latest, patch)
```

`processContainerAction()` captures only the raw action string at `pkg/controllers/app/engine/controllers/docker_watcher.go:337-355`, and `removeActionAnnotation()` at `pkg/controllers/app/engine/controllers/docker_watcher.go:474-480` uses that string as its only freshness check. That preserves a concurrent replacement when the value changes, but it cannot distinguish “the same request I just processed” from “a new request that rewrote the same action value”. The contract in `docs/design/api_containers.md:237-240` says a new write replaces the pending action, but with the current implementation a second `restart` written while the first `restart` is still in flight is silently deleted during cleanup instead of being left pending for the next reconcile.

Fix: give each action request an identity separate from the action name, for example by encoding `{action, nonce}` in the annotation payload or by storing a companion request-id annotation. Copy that request id into `status.lastAction`, and make `removeActionAnnotation()` compare/remove by request id rather than by the action string alone.

### Design Observations

**Strengths**

The shift from `spec.state` to one-shot action annotations is the right abstraction for a Docker-backed mirror. It removes the level-triggered fight with Docker restart policy and direct CLI writes without inventing a durable action queue.

Recording `lastAction` separately from live container state is also a good split. It lets the GUI reason about “what just happened” without overloading `status.status` with transient intent.

### Testing Assessment

1. The suite does not cover the create path for action validation. `bats/tests/32-app-controllers/engine-docker.bats:366-384` proves only that invalid annotations are rejected on update, not that a hand-written `Container` create is rejected.
2. There is no coverage for the same-value replacement edge in S1, especially `restart` rewritten while a previous `restart` is still being processed.

### Documentation Assessment

`docs/design/api_containers.md` is much clearer after this rewrite, but it now describes `Container` as a status-only mirror and `ContainerCreateRequest` as the creation mechanism while the API still accepts direct `Container` creates unless I1 is fixed.

### Acknowledged Limitations

- **Code comment**: `pkg/controllers/app/engine/controllers/engine_reconciler.go:186-190` says “cost is O(N) per child event” and that the long-term fix is per-resource reconcilers with watch predicates. This change keeps that tradeoff but does not materially worsen it.
- **Code comment**: `docs/design/api_containers.md:242-248` documents that a crash after the Docker call can replay the action and may double-restart on `restart`. The implementation matches that documented limitation.
- **PR comment**: The author explicitly declined CLI support for container actions, with the rationale that this API exists for the GUI and terminal users should use `docker`/`nerdctl` directly. This PR remains consistent with that rationale.

### Coverage Summary

- `bats/tests/32-app-controllers/engine-docker.bats` — I1
- `docs/design/api_containers.md` — I1, S1
- `pkg/apis/containers/v1alpha1/applyconfiguration/containers/v1alpha1/container.go` — Reviewed, no issues
- `pkg/apis/containers/v1alpha1/applyconfiguration/containers/v1alpha1/containerlastaction.go` — Reviewed, no issues
- `pkg/apis/containers/v1alpha1/applyconfiguration/containers/v1alpha1/containerspec.go` — Trivial
- `pkg/apis/containers/v1alpha1/applyconfiguration/containers/v1alpha1/containerstatus.go` — Reviewed, no issues
- `pkg/apis/containers/v1alpha1/applyconfiguration/utils.go` — Reviewed, no issues
- `pkg/apis/containers/v1alpha1/container_types.go` — I1
- `pkg/apis/containers/v1alpha1/zz_generated.deepcopy.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/docker_watcher.go` — I1, S1
- `pkg/controllers/app/engine/controllers/engine_reconciler.go` — I1
- `pkg/controllers/app/engine/controllers/sync_containers.go` — Reviewed, no issues
- `pkg/controllers/containers/container/container_webhooks.go` — I1
- `pkg/controllers/containers/container/crd.yaml` — I1
- `pkg/controllers/mock/container_reconciler.go` — Reviewed, no issues

---

### [Gemini] — Pass 1

**Executive Summary**

This PR replaces the level-triggered `Container.spec.state` with a one-shot `containers.rancherdesktop.io/action` annotation to drive container lifecycle actions (start, stop, pause, unpause, restart). The engine reconciler processes the annotation, dispatches the request to the Docker daemon, records the outcome in a new `.status.lastAction` field, and clears the annotation. The design elegantly avoids race conditions with Docker's internal restart policies and concurrent CLI users by shifting the API to a status-only mirror with edge-triggered intent. 

### Findings

#### Critical Issues
None.

#### Important Issues

**I1. Read-after-write cache staleness causes 409 Conflict loops and action replay** — `pkg/controllers/app/engine/controllers/docker_watcher.go:343-349` (important, regression)

```go
	if err := w.patchContainerLastAction(ctx, c.Name, lastAction); err != nil {
		return fmt.Errorf("failed to patch lastAction for %s: %w", c.Name, err)
	}
	if err := w.removeActionAnnotation(ctx, c.Name, raw); err != nil {
		return fmt.Errorf("failed to remove action annotation for %s: %w", c.Name, err)
	}
```

When `patchContainerLastAction` successfully updates the `.status` subresource, the API server increments the container's `ResourceVersion`. Immediately after, `removeActionAnnotation` is called to clear the annotation. Because its internal retry loop starts with `w.k8s.Get` using the cached client, it will almost certainly read the stale object (pre-status-update) because the watch event has not yet reached the informer. 

The subsequent JSON merge patch uses `client.MergeFromWithOptimisticLock{}`, which attaches the stale `ResourceVersion` to the payload, predictably resulting in a `409 Conflict`. If the cache does not sync within the ~310ms budget of `retry.DefaultRetry`, the function fails and returns the error. This causes `processContainerAction` to requeue, and the entire Docker action is re-executed on the next reconcile pass. 

Fix: Refactor `patchContainerLastAction` to return the updated `Container` object (since `w.k8s.Status().Patch` updates the passed object in-place with the new `ResourceVersion`). Pass this fresh object to `removeActionAnnotation` so the first patch attempt succeeds without a cached read.

```diff
-func (w *dockerWatcher) patchContainerLastAction(ctx context.Context, id string, lastAction *containersv1alpha1.ContainerLastAction) error {
+func (w *dockerWatcher) patchContainerLastAction(ctx context.Context, id string, lastAction *containersv1alpha1.ContainerLastAction) (*containersv1alpha1.Container, error) {
 	key := client.ObjectKey{Name: id, Namespace: w.apiNamespace}
+	var latest containersv1alpha1.Container
-	return retry.RetryOnConflict(retry.DefaultRetry, func() error {
-		var latest containersv1alpha1.Container
+	err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
 		if err := w.k8s.Get(ctx, key, &latest); err != nil {
 			return err
 		}
 		patch := client.MergeFromWithOptions(latest.DeepCopy(), client.MergeFromWithOptimisticLock{})
 		latest.Status.LastAction = lastAction
 		return w.k8s.Status().Patch(ctx, &latest, patch)
 	})
+	return &latest, err
 }

-func (w *dockerWatcher) removeActionAnnotation(ctx context.Context, id, observed string) error {
-	key := client.ObjectKey{Name: id, Namespace: w.apiNamespace}
+func (w *dockerWatcher) removeActionAnnotation(ctx context.Context, latest *containersv1alpha1.Container, observed string) error {
 	return retry.RetryOnConflict(retry.DefaultRetry, func() error {
-		var latest containersv1alpha1.Container
-		if err := w.k8s.Get(ctx, key, &latest); err != nil {
-			return err
-		}
 		current, present := latest.Annotations[containersv1alpha1.AnnotationAction]
 		if !present || current != observed {
 			return nil
 		}
 		patch := client.MergeFromWithOptions(latest.DeepCopy(), client.MergeFromWithOptimisticLock{})
 		delete(latest.Annotations, containersv1alpha1.AnnotationAction)
-		return w.k8s.Patch(ctx, &latest, patch)
+		err := w.k8s.Patch(ctx, latest, patch)
+		if apierrors.IsConflict(err) {
+			// Fetch fresh object from cache/API before the next retry
+			if getErr := w.k8s.Get(ctx, client.ObjectKeyFromObject(latest), latest); getErr != nil {
+				return getErr
+			}
+		}
+		return err
 	})
 }
```
*(Update `processContainerAction` to capture `latest, err := w.patchContainerLastAction(...)` and pass `latest` down to `removeActionAnnotation`.)*

#### Suggestions
None.

### Design Observations

#### Strengths
- **Implicit "In Progress" State**: Using the presence of the one-shot annotation as the implied "in-progress" indicator is an elegant design choice. It gives UI clients a clear signal that the action is executing without requiring a third, latency-inducing API write (e.g., `state=InProgress`) before dispatching the Docker call.
- **Concurrent Request Safety**: `removeActionAnnotation` checking `current != observed` before deletion flawlessly prevents dropping a user's concurrent action request (e.g., an impatient user switching a `pause` to `stop` mid-flight). 
- **Graceful Webhook Bypasses**: The `processContainerAction` safety net explicitly drops invalid action values (`!action.IsValid()`). This perfectly guards against invalid values that bypass the `Update` webhook when it's temporarily offline or if an invalid annotation is injected during a `Create` request.

### Testing Assessment
1. The BATS tests thoroughly verify the end-to-end flow, correctly validating the new `.status.lastAction` fields and failure paths (e.g., Docker refusing to pause a stopped container). 
2. The idempotency checks implicitly validate that `moby/client` surfaces a 304 Not Modified as a non-error for `start` and `stop`.

### Documentation Assessment
The `docs/design/api_containers.md` update accurately models the edge-triggered action approach, clarifying the idempotency guarantees and error surfacing mechanisms.

### Acknowledged Limitations
- **Code comment**: "Restart has no target state to match, so a replay after a crash runs a second restart; the double restart is harmless." (`docs/design/api_containers.md:245`)
  - **PR comment**: The author explicitly documented that a controller crash mid-flight during a `restart` action will leave the annotation in place, resulting in a second (harmless) restart on the subsequent reconcile loop. This remains an acknowledged boundary condition of the stateless reconciler design.

### Coverage Summary
- `bats/tests/32-app-controllers/engine-docker.bats` — Reviewed, no issues (prompt test annotation glitch noted as a rendering artifact).
- `docs/design/api_containers.md` — Reviewed, no issues.
- `pkg/apis/containers/v1alpha1/applyconfiguration/containers/v1alpha1/container.go` — Reviewed, no issues.
- `pkg/apis/containers/v1alpha1/applyconfiguration/containers/v1alpha1/containerlastaction.go` — Reviewed, no issues.
- `pkg/apis/containers/v1alpha1/applyconfiguration/containers/v1alpha1/containerspec.go` — Trivial.
- `pkg/apis/containers/v1alpha1/applyconfiguration/containers/v1alpha1/containerstatus.go` — Reviewed, no issues.
- `pkg/apis/containers/v1alpha1/applyconfiguration/utils.go` — Trivial.
- `pkg/apis/containers/v1alpha1/container_types.go` — Reviewed, no issues.
- `pkg/apis/containers/v1alpha1/zz_generated.deepcopy.go` — Reviewed, no issues.
- `pkg/controllers/app/engine/controllers/docker_watcher.go` — **I1**
- `pkg/controllers/app/engine/controllers/engine_reconciler.go` — Reviewed, no issues.
- `pkg/controllers/app/engine/controllers/sync_containers.go` — Reviewed, no issues.
- `pkg/controllers/containers/container/container_webhooks.go` — Reviewed, no issues.
- `pkg/controllers/containers/container/crd.yaml` — Reviewed, no issues.
- `pkg/controllers/mock/container_reconciler.go` — Reviewed, no issues.
