# Deep Review: 20260422-110156-pr-323

| | |
|---|---|
| **Date** | 2026-04-22 11:01 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 1 |
| **Author** | [@mook-as](https://github.com/mook-as) |
| **PR** | [#323](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/323) — implement log passthrough |
| **Commits** | `71b57d2` Address (AI) review comments<br>`78e9315` implement log passthrough |
| **Reviewers** | Claude Opus 4.7 (effort: xhigh), Codex GPT 5.4 (effort: high), Gemini 3.1 Pro (effort: default) |
| **Verdict** | **Merge with fixes** — one genuine data race on `r.apiNamespace` is worth addressing, plus a handful of clarity and test-coverage cleanups |
| **Wall-clock time** | `38 min 3 s` |

---

## Executive Summary

PR #323 adds a WebSocket log-passthrough endpoint (`/passthrough/engine/logs/{container}`) for the Docker backend, wiring it through the engine controller's `HandleLogs` handler and exposing the engine's shared-mux entry via `GetPassthroughEndpoints`. The second commit addresses earlier AI-review feedback: binary frame encoding, pre-upgrade TTY detection via `ContainerInspect`, websocket ping/pong keep-alive, `tail`/`follow` query parameters, watcher-mutex access, and late handler binding to tolerate startup ordering.

The implementation is structurally sound and the review consensus is positive. No critical defects surfaced. One **important** issue: the new HTTP handler reads `r.apiNamespace` from arbitrary goroutines without synchronization, violating the struct's documented single-reconciler invariant. The rest are suggestions — mainly doc drift (API contract promises UTF-8 text while the wire carries raw binary; "full container ID" claimed but any hex string is accepted), untested negative paths for the new endpoint, and a mirror-creation race in the new BATS tests that will surface as a CI flake.

Structure: 1 important (data race), 11 suggestions (including 2 merged from multiple agents), 1 design concern (blank-import gap), several acknowledged limitations tied to pending containerd work.

---

## Critical Issues

None.

---

## Important Issues

I1. **Unsynchronized read of `r.apiNamespace` in `HandleLogs`** — `pkg/controllers/app/engine/controllers/container_passthrough.go:70-74` [Claude Opus 4.7]

```go
var c containersv1alpha1.Container
err := r.Client.Get(req.Context(), types.NamespacedName{
    Namespace: r.apiNamespace,
    Name:      containerID,
}, &c)
```

`r.apiNamespace` is written by `Reconcile` at `pkg/controllers/app/engine/controllers/engine_reconciler.go:165` with no lock, and is now read here from HTTP-handler goroutines with no lock. `go test -race` will flag it, and a weak-memory-model architecture could produce a torn or empty read. In practice `apiNamespace` is effectively immutable after the first reconcile (per the field doc at `engine_reconciler.go:122-124`), so today's worst observable outcome is a briefly-empty-string read that yields a spurious 404 before the first Reconcile has run — but the race is real.

The struct-level comment at `engine_reconciler.go:113-115` claims "Only Reconcile and the manager's shutdown-hook goroutine... contend for the fields below." That invariant is now false — `HandleLogs` is a third concurrent accessor. The field comment at line 126-127 was simultaneously rewritten from a precise description of what `watcherMu` guards against to a generic note about hold duration, so the struct's concurrency contract is now less documented than before the change (see S1).

Fix: acquire `watcherMu` once at entry and read `watcher` and `apiNamespace` under it; update the struct-level comment and the `watcherMu` field comment to name `HandleLogs` as a concurrent reader. Reordering also lets callers see the more-specific 503 before the 404, which helps diagnosis (see S6).

```diff
-	var c containersv1alpha1.Container
-	err := r.Client.Get(req.Context(), types.NamespacedName{
-		Namespace: r.apiNamespace,
-		Name:      containerID,
-	}, &c)
+	r.watcherMu.Lock()
+	watcher := r.watcher
+	apiNamespace := r.apiNamespace
+	r.watcherMu.Unlock()
+	if watcher == nil {
+		log.V(5).Info("Docker watcher not running")
+		http.Error(w, "Docker watcher not running", http.StatusServiceUnavailable)
+		return
+	}
+
+	var c containersv1alpha1.Container
+	err := r.Client.Get(req.Context(), types.NamespacedName{
+		Namespace: apiNamespace,
+		Name:      containerID,
+	}, &c)
```

`(important, regression)`

---

## Suggestions

S1. **`watcherMu` field comment and `EngineReconciler` struct doc went stale** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:113-127` [Claude Opus 4.7]

```go
// The App is a cluster-scoped singleton, so controller-runtime runs at
// most one Reconcile at a time. Only Reconcile and the manager's
// shutdown-hook goroutine (see SetupWithManager) contend for the
// fields below.
type EngineReconciler struct {
    ...
    // watcherMu guards r.watcher; note that this may be held for a long time
    // during initialization.
    watcherMu sync.Mutex
    watcher   engine
```

Pre-PR the field comment read: "watcherMu guards r.watcher against the shutdown-hook goroutine's stopWatcher call. Reconcile runs serially (see struct doc), so the lock has no other role." The rewrite dropped the writer-identification (the shutdown hook) and the serial-Reconcile note in exchange for a hold-time remark that was equally true of the old code. After this PR, `HandleLogs` is a third concurrent reader, so the comment should grow more specific, not less.

Fix: enumerate the three concurrent participants (Reconcile, shutdown-hook goroutine, `HandleLogs`) and state which fields each one touches. Update the struct doc at lines 113-115 in the same pass.

`(suggestion, regression)`

S2. **Docs promise UTF-8 message text, but the endpoint streams raw binary** — `docs/design/api_containers.md:262-266` [Codex GPT 5.4]

```md
An endpoint at `/passthrough/.../logs/${container}` will speak WebSocket;
messages are one way, as stream of bytes; messages should not be buffered.
Message text must be UTF-8 encoded.  The last portion of the path must be the
full container ID.
```

The doc mixes two incompatible contracts: "stream of bytes" plus per-message UTF-8. The implementation provides only the former. `websocketWriter.Write` at `pkg/controllers/app/engine/controllers/container_passthrough.go:45-56` emits one `websocket.BinaryMessage` per raw `Write`, fed by `io.Copy` or `stdcopy.StdCopy` at lines 224-227. Those writers can split anywhere, so a frame can end mid-rune or contain arbitrary non-text bytes from the container stream. A client that codes to "Message text must be UTF-8 encoded" will build wrong framing assumptions.

Fix: document this as an opaque binary stream with no frame-to-line or frame-to-UTF-8 guarantee. The exec endpoint (line 276-278) references "unbuffered binary as in the logs endpoint", so align the wording there once logs is fixed. The sentence also reads ungrammatically — "as stream of bytes" wants an article — worth polishing while the doc is open.

```diff
-An endpoint at `/passthrough/.../logs/${container}` will speak WebSocket;
-messages are one way, as stream of bytes; messages should not be buffered.
-Message text must be UTF-8 encoded.  The last portion of the path must be the
-full container ID.
+An endpoint at `/passthrough/.../logs/${container}` will speak WebSocket.
+Messages are one way and binary; frame boundaries are transport artifacts and
+do not correspond to log lines or UTF-8 rune boundaries. The last portion of
+the path must be the full container ID.
```

`(suggestion, regression)`

S3. **Container ID validator is too permissive: accepts any-length and uppercase hex** — `pkg/controllers/app/engine/controllers/container_passthrough.go:37-38` [Codex GPT 5.4] [Gemini 3.1 Pro]

```go
var containerIDValidator = regexp.MustCompile(`^[0-9a-fA-F]+$`)
```

Two independent observations converge on the same regex:

The API doc at `docs/design/api_containers.md:265-266` says "the full container ID" is required, but the validator accepts any hex prefix. `/passthrough/engine/logs/deadbeef` passes the 400 gate and hits the mirror lookup at lines 71-74, returning 404 and making client bugs harder to diagnose.

The validator also accepts uppercase A–F. A Kubernetes object name must be a DNS-1123 subdomain (lowercase only), so `r.Client.Get` rejects `DEADBEEF...` as `Invalid`. That error is neither `IsNotFound` nor `IsInvalidArgument` as the handler understands them, so it falls to the `default` branch and returns 500 instead of 400.

Fix: tighten the regex to the canonical Docker ID shape, which is 64 lowercase hex characters:

```diff
-var containerIDValidator = regexp.MustCompile(`^[0-9a-fA-F]+$`)
+var containerIDValidator = regexp.MustCompile(`^[0-9a-f]{64}$`)
```

Or relax the doc to say prefixes are accepted but only full mirrored IDs resolve.

`(suggestion, regression)`

S4. **`websocketWriter.Write` leaks a half-open writer when the inner `Write` fails** — `pkg/controllers/app/engine/controllers/container_passthrough.go:45-56` [Claude Opus 4.7]

```go
func (w *websocketWriter) Write(p []byte) (int, error) {
    _ = w.conn.SetWriteDeadline(time.Now().Add(writeTimeout))
    writer, err := w.conn.NextWriter(websocket.BinaryMessage)
    if err != nil {
        return 0, err
    }
    n, err := writer.Write(p)
    if err == nil {
        err = writer.Close()
    }
    return n, err
}
```

When `writer.Write(p)` errors, the per-message writer is not closed. Gorilla's `beginMessage` then rejects the next `NextWriter` call with `"websocket: concurrent write to websocket connection"` until the connection is torn down. Today this is benign: `stdcopy.StdCopy` and `io.Copy` both stop on the first write error, so no next call happens. If the writer is reused outside `stdcopy` (or a future caller retries), the leak turns subsequent writes into a cryptic error.

Fix: always close the writer and surface the first non-nil error.

```diff
 	n, err := writer.Write(p)
-	if err == nil {
-		err = writer.Close()
-	}
-	return n, err
+	if closeErr := writer.Close(); err == nil {
+		err = closeErr
+	}
+	return n, err
```

`(suggestion, regression)`

S5. **`HandleLogs` doc comment omits the `follow` query parameter** — `pkg/controllers/app/engine/controllers/container_passthrough.go:58-60` [Claude Opus 4.7]

```go
// HandleLogs implements the log endpoint to pass through container logs.
// It is at `/passthrough/engine/logs/{container}[?tail=999]`.
func (r *EngineReconciler) HandleLogs(w http.ResponseWriter, req *http.Request) {
```

`follow` is parsed at lines 89-97 but absent from the godoc. `docs/design/api_containers.md:273` lists both parameters, so this is the only place a Go caller would see a stale contract.

Fix: extend the doc to list both parameters, or point at the design doc.

`(suggestion, regression)`

S6. **`GetPassthroughEndpoints` sort differs between engine and mock/demo controllers** — `pkg/controllers/app/engine/controller.go:73-75` [Claude Opus 4.7]

```go
func (c *controller) GetPassthroughEndpoints() []string {
    return slices.Sorted(maps.Keys(c.passthroughs))
}
```

The engine returns a sorted slice; `pkg/controllers/mock/mock_controller.go:81` and `pkg/controllers/app/demo/controller.go:77` return `slices.Collect` output with no ordering. The interface doc at `pkg/controllers/base/controller.go:64-70` leaves order undefined, so nothing breaks, but discovery ConfigMap payloads and log lines flicker for controllers with multiple endpoints. Pick one convention and apply it everywhere.

Fix: switch mock and demo to `slices.Sorted` for stable `rdd ctl get configmap/rdd-controller-manager` diffs, or document the interface as order-undefined and accept flicker.

`(suggestion, gap)`

S7. **"Docker watcher not running" is misleading on the containerd backend** — `pkg/controllers/app/engine/controllers/container_passthrough.go:105-108` [Claude Opus 4.7]

```go
if watcher == nil {
    log.V(5).Info("Docker watcher not running")
    http.Error(w, "Docker watcher not running", http.StatusServiceUnavailable)
    return
}
```

`watcher == nil` covers two very different conditions: App stopped, and `App.spec.containerEngine.name == "containerd"` (see `engine_reconciler.go:197` where `wantWatcher = running && engineIsDocker`). A GUI user on the containerd backend sees a transient and misleading "not running" message. Containerd support is explicitly planned (comment on `engineMoby` at `engine_reconciler.go:51-54`).

Fix: distinguish the two cases. Read `App.Spec.ContainerEngine.Name` inline, or stash a small reason when the reconciler tears the watcher down, and return e.g. "log passthrough is not supported on the containerd backend" with `501 Not Implemented`.

`(suggestion, gap)`

S8. **New BATS tests issue log requests before waiting for the Container mirror** — `bats/tests/32-app-controllers/engine-docker.bats:185-188` [Claude Opus 4.7]

```bash
run_e -0 docker run --detach --tty --name test-logs-tty busybox /bin/sh -c "${command}"
container=${output}

do_websocket "/passthrough/engine/logs/${container}"
```

Every other new test waits for the mirror with `rdd ctl wait --for=create container/${cid}` before touching it (see "docker run creates Container resource" at line 140-143). The three logs tests skip that wait. On a slow VM the Docker-event-to-mirror path can trail the Docker API by long enough for `r.Client.Get` to return NotFound, producing a 404 and a failed assertion. The `sleep 1` inside the busybox command does not help because the test issues the request before the sleep runs.

Fix: prefix each `do_websocket` call with `rdd ctl wait --for=create --namespace="${RDD_NAMESPACE}" container/"${container}" --timeout=30s`.

`(suggestion, gap)`

S9. **Negative paths for the new endpoint are untested** — `pkg/controllers/app/engine/controllers/container_passthrough.go:89-97` [Claude Opus 4.7] [Codex GPT 5.4]

```go
if f := req.URL.Query().Get("follow"); f != "" {
    if follow, err := strconv.ParseBool(f); err == nil {
        opts = append(opts, engineLogWithFollow(follow))
    } else {
        log.V(5).Info("Invalid follow parameter", "follow", f, "error", err)
        http.Error(w, "Invalid follow parameter", http.StatusBadRequest)
        return
    }
}
```

The 400 branches (invalid `follow`, non-hex container ID) and the 404 branch (unknown container ID) have no test coverage in `bats/tests/32-app-controllers/engine-docker.bats`. The mock passthrough suite at `bats/tests/34-containers-controllers/containers-mock.bats:75-79` already exercises a similar pattern, so one curl assertion each lands cheaply and guards against status-code regressions.

Fix: add three assertions — invalid `follow=maybe`, non-hex container ID, and hex-but-unknown ID.

`(suggestion, gap)`

S10. **`conn.SetReadLimit(4096)` is a magic number** — `pkg/controllers/app/engine/controllers/container_passthrough.go:156` [Claude Opus 4.7]

```go
defer conn.Close()
conn.SetReadLimit(4096)
```

The other five timing-related literals at the top of the file (`controlTimeout`, `writeTimeout`, `pingInterval`, `pingTimeout`) are named. This one deserves the same treatment, both for consistency and because the rationale — clients are not expected to send anything large; 4096 covers PONGs and any future client-to-server protocol — is not obvious at the call site.

Fix: lift to a named constant alongside the others, with a short comment on the rationale.

`(suggestion, regression)`

S11. **`MarkControlPlaneReady` fires before controllers start, and silently proceeds on discovery-registration failure** — `pkg/service/controllers/manager.go:193-203` [Gemini 3.1 Pro]

```go
// Register controller manager in cluster for service discovery before
// running the controllers.
if err := scm.registerDiscovery(ctx); err != nil {
    log.Error(err, "Failed to register controller manager for discovery")
    // Don't fail startup for discovery registration errors
}

// Mark the control plane ready after registerDiscovery so clients
// waiting on the ready annotation see both CRDs installed and
// controller registration written, not just CRDs.
if err := MarkControlPlaneReady(ctx, scm.discovery.client); err != nil {
```

The annotation advertises readiness at line 201, but the passthrough server (added as a `manager.Runnable` at line 215) does not start listening until `mgr.Start(ctx)` at line 284. A client that sees the ready annotation and races to connect hits connection-refused on the new passthrough port. Discovery-registration failure is also swallowed, so the annotation may fire over an empty or partial discovery ConfigMap.

The PR does not modify this block, but the new passthrough endpoint is precisely the new surface that exposes the readiness race — the `handler == nil` defensive check the PR added at the newly-modified lines 479-486 acknowledges the same ordering concern at a finer grain. Worth addressing together rather than layering more workarounds.

Fix: move `MarkControlPlaneReady` into a post-`mgr.Start` `manager.Runnable`, and only fire it when discovery registration succeeded.

`(suggestion, gap)`

---

## Design Observations

### Strengths

- **Late WebSocket upgrade** — `pkg/controllers/app/engine/controllers/container_passthrough.go:147-154` (in-scope) [Claude Opus 4.7] [Codex GPT 5.4] [Gemini 3.1 Pro]. All three reviewers called this out: the handler defers the upgrade until after the K8s `Get`, Docker `ContainerInspect`, and `ContainerLogs` open succeed. Errors flow back through real HTTP status codes instead of post-handshake WebSocket close frames, which is much easier for clients to handle.
- **Pre-upgrade TTY detection via `ContainerInspect`** — `pkg/controllers/app/engine/controllers/docker_watcher.go:463-470` (in-scope) [Codex GPT 5.4] [Gemini 3.1 Pro]. `Client.ContainerLogs` does not expose media type, so inspecting `Config.Tty` first is the correct way to pick between raw copy and `stdcopy.StdCopy`. Matches Docker's own CLI strategy.
- **Graceful ping/pong keep-alive** — `pkg/controllers/app/engine/controllers/container_passthrough.go:176-220` (in-scope) [Gemini 3.1 Pro]. Correct use of Go 1.23 timer semantics; reset on pong and on successful read; cleanly canceled via the shared context.
- **Late `GetPassthroughHandler` binding in the mux** — `pkg/service/controllers/manager.go:479-486` (in-scope) [Claude Opus 4.7]. Moving the handler lookup inside the `http.HandlerFunc` closure tolerates the startup window when `c.reconciler` is nil and returns 503 instead of panicking.

### Concerns

- **The engine controller is not blank-imported by `cmd/app-controller/main.go` or `cmd/containers-controller/main.go`** (future) [Claude Opus 4.7]. Only `pkg/service/cmd/service.go:59` blank-imports `pkg/controllers/app/engine`. Per the repo's deep-review steering, "both `cmd/*/main.go` and the embedded service package blank-import the controller." Running the engine reconciler as an out-of-tree process would silently drop its passthrough endpoint with no signal to the operator. Pre-existing gap, but more visible now that the passthrough endpoint exists.

---

## Testing Assessment

Integration coverage for the happy paths is reasonable: the new BATS cases exercise TTY and non-TTY log streams, `tail`, and `follow=false`. The gaps are concentrated in three areas:

1. **Mirror-creation race in the new tests** (S8) — the most likely source of CI flakes.
2. **Negative paths untested** (S9) — invalid `follow`, non-hex container ID, hex-but-unknown ID. All reachable; none asserted.
3. **Ping/pong keep-alive uncovered** — tests complete in under a minute, so `pingInterval = 10s` / `pingTimeout = 60s` never fire. Not a regression, but the keep-alive is the most intricate new code.
4. **Stderr demux on the non-TTY path** — the `stdcopy.StdCopy` branch at `container_passthrough.go:227` is exercised indirectly; no test asserts stderr preservation.
5. **Log request on the containerd backend** — no test for the 503 path (which today carries the misleading message from S7).

`make build-rdd` and `go test ./pkg/controllers/app/engine/... ./pkg/service/controllers ./pkg/controllers/mock/...` reported green during review (Codex).

---

## Documentation Assessment

- `docs/design/api_containers.md:262-273` describes the new endpoint and its query parameters, but mixes incompatible framing guarantees (binary stream plus per-message UTF-8, see S2) and promises a "full container ID" the handler does not enforce (see S3).
- `pkg/controllers/app/engine/controllers/container_passthrough.go:58-59` — godoc for `HandleLogs` lists `tail` but not `follow` (see S5).
- `pkg/controllers/app/engine/controllers/engine_reconciler.go:113-127` — struct doc and `watcherMu` field comment went stale against the new HTTP-handler concurrency (see I1 and S1).
- Minor prose: line 264 "as stream of bytes" wants "as a stream of bytes"; line 277 "unbuffered binary as in the logs endpoint" is ungrammatical. Easy polish while the doc is open.

---

## Commit Structure

Two commits, cleanly split: the first introduces the minimal implementation; the second addresses prior review feedback with an enumerated changelog. No concerns.

---

## Acknowledged Limitations

- **Containerd backend returns 503 for log requests** — containerd mirroring is still on the roadmap (see `engine_reconciler.go:51-54`). S7 covers the user-facing message today.
- **`Client.ContainerLogs` does not expose media type** — raised by @mook-as in the PR thread. The current `ContainerInspect` + `Config.Tty` approach is the resolution and matches Docker's own CLI behavior.
- **TTY detection via inspect is a point-in-time read** — the container's TTY configuration is immutable after create, so this is not a race the handler can lose.

---

## Unresolved Feedback

None. The three in-thread PR comments either landed as changes in the second commit or are captured under Acknowledged Limitations.

---

## Agent Performance Retro

**[Claude]** — The deepest and most actionable pass: surfaced the only important finding (the `apiNamespace` data race), traced it to the stale struct-level doc and rewritten `watcherMu` field comment, and proposed a fix that also improves diagnostics by reordering the 404/503 branches. Caught two gap-level issues the other agents missed (the BATS mirror-creation race and the engine controller's missing blank-imports). Its suggestions lean toward clarity and test-coverage cleanups, which fits a PR this tight.

**[Codex]** — Focused and precise, but narrower than the others: limited to two doc/contract mismatches and nothing on the implementation itself. Both findings landed — UTF-8 framing and the over-permissive container ID regex — and the latter merged cleanly with Gemini's uppercase-hex observation. Ran `make build-rdd` and the package tests to back its "no merge-blocker in the implementation" verdict, which the consolidation accepted.

**[Gemini]** — Two useful angles the others missed: the pre-existing `MarkControlPlaneReady` readiness race that the new passthrough endpoint exposes, and the uppercase-hex case of the container ID regex. The `MarkControlPlaneReady` finding was labeled important, but the code it points to is not changed in this PR; consolidation downgraded it to a suggestion (S11) framed around the PR's new surface. Hit a sandbox denial trying to write a throwaway test file outside the worktree (`/tmp/testtimer.go`); recovered without impact.

### Summary

| | Claude Opus 4.7 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 21m 00s | 5m 38s | 8m 31s |
| Findings | 1I 8S | 2S | 2S |
| Tool calls | 53 (Bash 30, Read 21, Grep 2) | 44 (shell 41, stdin 3) | 32 (run_shell_command 16, grep_search 9, glob 2) |
| Design observations | 2 | 2 | 3 |
| False positives | 0 | 0 | 0 |
| Unique insights | 7 | 1 | 2 |
| Files reviewed | 8 | 8 | 8 |
| Coverage misses | 0 | 0 | 0 |
| **Totals** | **1I 8S** | **2S** | **2S** |
| Downgraded | 0 | 0 | 1 (I→S) |
| Dropped | 0 | 0 | 0 |


Claude delivered the most value: the only important finding, the most unique insights, and the design-level observations that tied findings together. Codex and Gemini each contributed one useful suggestion that the others missed. The three reviews reinforced each other on the strengths (late upgrade, pre-upgrade TTY detection) without redundantly repeating them.

### Reconciliation

Gemini P1 "Pre-existing readiness signal" (important, on unchanged code): important → suggestion S11. Rationale: the code is unchanged by this PR; the PR's new passthrough endpoint is the new surface exposing the pre-existing race, so it belongs as a suggestion tied to the PR rather than a blocker for it.

---

## Review Process Notes

### Repo context updates

- Add a note to `deep-review-context.md` about the engine controller's blank-import gap in `cmd/*/main.go`. Reviewers should check this when a PR adds a new user-visible surface (API endpoint, passthrough handler, webhook) to a controller that is not yet registered in every standalone binary — a missing blank-import silently drops the feature in those binaries with no startup error. Apply when the PR adds a surface whose absence would not surface as a test failure.

---

## Appendix: Original Reviews

### Claude Opus 4.7

## Executive Summary

PR #323 adds a WebSocket log-passthrough endpoint (`/passthrough/engine/logs/{container}`) for the Docker backend, wiring the engine controller into the existing passthrough mux. The second commit addresses earlier AI review feedback (binary frames, TTY detection, ping/pong, tail/follow params, watcher-mutex access, late handler lookup). Review found no critical issues; the main concern is a data race reading `r.apiNamespace` in the new HTTP-serving goroutine, and a handful of clarity/test-coverage suggestions.

## Findings

### Critical Issues

None.

### Important Issues

**I1. Unsynchronized read of `r.apiNamespace` in HandleLogs** — `pkg/controllers/app/engine/controllers/container_passthrough.go:72`

```go
var c containersv1alpha1.Container
err := r.Client.Get(req.Context(), types.NamespacedName{
    Namespace: r.apiNamespace,
    Name:      containerID,
}, &c)
```

`r.apiNamespace` is written by `Reconcile` on line 165 of `engine_reconciler.go` with no lock, and is now read here on line 72 from HTTP-handler goroutines with no lock. This is a concurrent read/write — `go test -race` or production usage on an architecture with a weaker memory model will flag it. In practice the value is effectively immutable after the first reconcile (per the field doc at `engine_reconciler.go:122-124`), so the observable behavior today is at most a briefly-empty-string read that produces a misleading 404 before the first Reconcile has run. But it is still a true data race, and the struct doc at `engine_reconciler.go:113-115` (untouched by this PR) claims "Only Reconcile and the manager's shutdown-hook goroutine contend for the fields below" — HandleLogs now violates that invariant silently.

The field comment on `watcherMu` was simultaneously rewritten (lines 126-127) from a precise description of what mutex protects what to a generic note about lock duration, so the concurrency contract for this struct is now less documented than before the change, not more.

Fix: acquire `watcherMu` once and read both `watcher` and `apiNamespace` under it, and update the struct-level and field-level comments to name HandleLogs as a concurrent reader.

```diff
-	var c containersv1alpha1.Container
-	err := r.Client.Get(req.Context(), types.NamespacedName{
-		Namespace: r.apiNamespace,
-		Name:      containerID,
-	}, &c)
+	r.watcherMu.Lock()
+	watcher := r.watcher
+	apiNamespace := r.apiNamespace
+	r.watcherMu.Unlock()
+	if watcher == nil {
+		log.V(5).Info("Docker watcher not running")
+		http.Error(w, "Docker watcher not running", http.StatusServiceUnavailable)
+		return
+	}
+
+	var c containersv1alpha1.Container
+	err := r.Client.Get(req.Context(), types.NamespacedName{
+		Namespace: apiNamespace,
+		Name:      containerID,
+	}, &c)
...
-	r.watcherMu.Lock()
-	watcher := r.watcher
-	r.watcherMu.Unlock()
-	if watcher == nil {
-		...
-	}
```

Reordering also gives callers the more-specific 503 when the engine is not running, instead of a 404 that's indistinguishable from "your container ID is wrong" (see S5).

`(important, regression)`

### Suggestions

**S1. `watcherMu` field comment lost information about concurrent access** — `pkg/controllers/app/engine/controllers/engine_reconciler.go:126-127`

```go
// watcherMu guards r.watcher; note that this may be held for a long time
// during initialization.
watcherMu sync.Mutex
watcher   engine
```

The pre-PR comment named the specific concurrent writer the mutex protected against (the shutdown-hook goroutine) and stated that Reconcile runs serially. The rewrite drops both facts and adds a hold-time note that is true of the old code too. After this PR, there is a new concurrent reader (HandleLogs from arbitrary HTTP-serving goroutines), so the comment should be *more* specific, not less. Also note the struct-level comment at line 113-115 is now stale — it still claims only Reconcile and the shutdown goroutine contend for these fields.

Fix: rewrite the field comment to enumerate all readers and writers, and update the struct-level comment to mention HandleLogs.

`(suggestion, regression)`

**S2. `websocketWriter.Write` leaves a half-open writer after `writer.Write` fails** — `pkg/controllers/app/engine/controllers/container_passthrough.go:45-56`

```go
func (w *websocketWriter) Write(p []byte) (int, error) {
    _ = w.conn.SetWriteDeadline(time.Now().Add(writeTimeout))
    writer, err := w.conn.NextWriter(websocket.BinaryMessage)
    if err != nil {
        return 0, err
    }
    n, err := writer.Write(p)
    if err == nil {
        err = writer.Close()
    }
    return n, err
}
```

When the inner `writer.Write(p)` returns an error, the per-message writer is not closed. Gorilla's `beginMessage` then refuses the next `NextWriter` call with `"websocket: concurrent write to websocket connection"` until the connection is torn down. `stdcopy.StdCopy` stops on the first write error anyway, so this is benign today. But if the method is ever reused outside stdcopy, the leak turns subsequent writes into a cryptic error.

Fix: always close the writer, and surface the first non-nil error.

```diff
 	n, err := writer.Write(p)
-	if err == nil {
-		err = writer.Close()
-	}
-	return n, err
+	if closeErr := writer.Close(); err == nil {
+		err = closeErr
+	}
+	return n, err
```

`(suggestion, regression)`

**S3. `HandleLogs` doc comment does not list the `follow` parameter** — `pkg/controllers/app/engine/controllers/container_passthrough.go:58-59`

```go
// HandleLogs implements the log endpoint to pass through container logs.
// It is at `/passthrough/engine/logs/{container}[?tail=999]`.
func (r *EngineReconciler) HandleLogs(w http.ResponseWriter, req *http.Request) {
```

`follow` is now accepted (line 89-97) but not mentioned in the doc. `docs/design/api_containers.md:273` lists both parameters. Small drift, but since this is the only place a Go caller would look.

Fix: extend the comment to list both query parameters or point at the design doc.

`(suggestion, regression)`

**S4. `GetPassthroughEndpoints` is sorted here but unsorted in the mock controller** — `pkg/controllers/app/engine/controller.go:73-75`

```go
func (c *controller) GetPassthroughEndpoints() []string {
    return slices.Sorted(maps.Keys(c.passthroughs))
}
```

`pkg/controllers/mock/mock_controller.go:81` and `pkg/controllers/app/demo/controller.go:77` use `slices.Collect`, producing random order. The interface doc at `pkg/controllers/base/controller.go:64-70` does not require order, so nothing breaks, but discovery ConfigMap payloads and log lines now flicker for controllers with more than one endpoint while the engine is stable. Pick one convention and apply it consistently.

Fix: change mock and demo to `slices.Sorted` (stable output helps `rdd ctl get configmap/rdd-controller-manager` diffs), or document explicitly that order is undefined.

`(suggestion, gap)`

**S5. "Docker watcher not running" is misleading on the containerd backend** — `pkg/controllers/app/engine/controllers/container_passthrough.go:105-108`

```go
if watcher == nil {
    log.V(5).Info("Docker watcher not running")
    http.Error(w, "Docker watcher not running", http.StatusServiceUnavailable)
    return
}
```

`watcher == nil` is the branch both when the App is stopped AND when `App.spec.containerEngine.name == "containerd"` (see `engine_reconciler.go:197-199`: `wantWatcher = running && engineIsDocker`). A GUI user on the containerd backend will see a misleading transient "not running" message. Given containerd support is explicitly planned (comment on `engineMoby` at line 51-54), the error should be specific.

Fix: distinguish the two cases by also reading `App.Spec.ContainerEngine.Name` (or by stashing a small "reason" field when the reconciler tears the watcher down) and return e.g. `"log passthrough is not supported on the containerd backend"` with a more appropriate status (`501 Not Implemented` is reasonable).

`(suggestion, gap)`

**S6. New BATS tests do not wait for the Container mirror before requesting logs** — `bats/tests/32-app-controllers/engine-docker.bats:185-188`

```bash
run_e -0 docker run --detach --tty --name test-logs-tty busybox /bin/sh -c "${command}"
container=${output}

do_websocket "/passthrough/engine/logs/${container}"
```

Every other new test uses `rdd ctl wait --for=create container/${cid}` before touching the mirror (see the pre-existing "docker run creates Container resource" test at line 140-141). The logs tests skip this. On a slow VM the Docker event-to-mirror path can trail the Docker API by enough to make `r.Client.Get` return NotFound, producing a spurious 404 and a failed assertion that will surface as a flake. The `sleep 1` inside the busybox command doesn't help because the test isn't waiting on that sleep before it issues the request.

Fix: add `rdd ctl wait --for=create --namespace="${RDD_NAMESPACE}" container/"${container}" --timeout=30s` before each `do_websocket` call.

`(suggestion, gap)`

**S7. No test for invalid `follow` query parameter** — `pkg/controllers/app/engine/controllers/container_passthrough.go:89-97`

```go
if f := req.URL.Query().Get("follow"); f != "" {
    if follow, err := strconv.ParseBool(f); err == nil {
        opts = append(opts, engineLogWithFollow(follow))
    } else {
        log.V(5).Info("Invalid follow parameter", "follow", f, "error", err)
        http.Error(w, "Invalid follow parameter", http.StatusBadRequest)
        return
    }
}
```

The 400 path is unreachable from the current BATS suite. Similarly, there is no test for invalid (non-hex) container IDs or hex IDs the mirror does not know about — the mock tests cover the latter pattern (see `containers-mock.bats:76-79` uses `run -22`), but the engine tests skip negative paths entirely. One curl assertion each would catch regressions cheaply.

`(suggestion, gap)`

**S8. `conn.SetReadLimit(4096)` should be a named constant** — `pkg/controllers/app/engine/controllers/container_passthrough.go:156`

```go
defer conn.Close()
conn.SetReadLimit(4096)
```

The other five timing-related literals in the file are already lifted to named constants at the top of the file. This magic number should be too, both for consistency and because the rationale ("we don't expect the client to send much; 4096 is plenty of headroom for PONGs and any future protocol messages") isn't obvious.

`(suggestion, regression)`

## Design Observations

### Strengths

- Deferring the WebSocket upgrade until after all upstream lookups succeed (comment at `container_passthrough.go:147-148`) is the right call: it lets the handler return real HTTP statuses for the common failure modes instead of failing after the 101 handshake. The two inspection calls pre-upgrade (K8s `Get`, Docker `ContainerInspect`) are a small cost for much clearer error surfaces.
- The late `GetPassthroughHandler` binding in `manager.go:479-486` correctly handles the case where `c.reconciler` is nil during the window between passthrough-server startup and `RegisterWithManager`. The 503 response is the right default.

### Concerns

- **`engine` controller is not blank-imported by `cmd/app-controller/main.go` or `cmd/containers-controller/main.go`.** Per the repo deep-review steering ("both `cmd/*/main.go` and the embedded service package blank-import the controller"), the engine controller should be registered in every main that might serve the `containers` API group. Only `pkg/service/cmd/service.go:59` blank-imports it today. This is a pre-existing gap (the engine controller was added earlier), but it becomes more visible with this PR: running the engine reconciler as an out-of-tree process would silently drop its passthrough endpoint with no signal to the operator. Worth fixing now rather than later. `(future)`

## Testing Assessment

Integration coverage is reasonable for the happy paths. Untested scenarios, ranked by risk:

1. **Race between `docker run` and logs request on slow VMs** (S6) — the most likely flake source, and the one most visible in CI.
2. **Invalid `follow` query parameter** (S7) — 400 branch is unreachable from tests.
3. **Non-hex / nonexistent container ID** (S7) — the engine controller's 400/404 paths are untested.
4. **Log request while the watcher is stopped** — the 503 branch (currently mislabelled; see S5) has no coverage.
5. **Log request on the containerd backend** — same 503 branch, plus a policy question about whether the endpoint should be registered at all for backends that cannot serve it.
6. **Ping/pong keepalive path** — current tests complete in under a minute so `pingInterval = 10s` / `pingTimeout = 60s` never exercise a timeout. Not a regression, but the ping logic is the most complex new code and has no coverage.

No unit tests accompany the new package; given the BATS harness covers the end-to-end path, unit tests aren't strictly required, but a small table-driven test around query-parameter parsing would cover items 2 and 3 more cheaply than BATS.

## Documentation Assessment

- `docs/design/api_containers.md:262-273` is updated and consistent with the implementation.
- `container_passthrough.go:58-59` `HandleLogs` doc is incomplete (missing `follow`, see S3).
- `engine_reconciler.go:126-127` `watcherMu` comment is less informative than before (see S1/S1-partial).
- `engine_reconciler.go:113-115` struct doc is now stale ("Only Reconcile and the manager's shutdown-hook goroutine contend for the fields below" — HandleLogs is a new contender).
- Minor prose glitch in the API doc at line 264: `messages are one way, as stream of bytes` reads as a typo for `messages are one way, as a stream of bytes`. Similarly line 277 `unbuffered binary as in the logs endpoint` is ungrammatical. Nice to polish given the writing-clearly guidance in this repo.

## Commit Structure

Two commits, cleanly split: first commit introduces the minimal implementation, second addresses prior review feedback. Commit messages accurately describe the changes. The second commit's enumerated list is useful. No concerns.

## Acknowledged Limitations

- **`// TODO: Figure out containerd`** existed in the first commit (`container_passthrough.go`, commit 78e9315) and has been removed in the second commit, replaced with the engine-interface abstraction (`hasTTY`, `getLogs`). The containerd implementation is still pending per the engine-interface-watcher roadmap tracked in the author's working notes — the endpoint returning 503 on containerd today is consistent with that staging, although the error message does not communicate it (see S5).
- **`ContainerLogs` API contract quirk** (raised by @mook-as in the PR thread): the result type does not expose `Content-Type` / `MediaType`, so TTY cannot be inferred from the stream itself. The author's in-thread resolution (use `ContainerInspect` first, branch on `Config.Tty`) is exactly what the diff does at `docker_watcher.go:463-470`.

## Unresolved Feedback

None. The three PR-thread comments are either resolved by the second commit or are acknowledged limitations captured above.

## Coverage Summary

| File | Outcome |
| --- | --- |
| `bats/tests/32-app-controllers/engine-docker.bats` | S6, S7 |
| `docs/design/api_containers.md` | Reviewed, minor prose noted under Documentation Assessment |
| `pkg/controllers/app/engine/controller.go` | S4 |
| `pkg/controllers/app/engine/controllers/container_passthrough.go` | I1, S2, S3, S5, S8 |
| `pkg/controllers/app/engine/controllers/docker_watcher.go` | Reviewed, no issues |
| `pkg/controllers/app/engine/controllers/engine_reconciler.go` | I1, S1 |
| `pkg/controllers/mock/logs_passthrough.go` | Trivial (TextMessage → BinaryMessage to match engine) |
| `pkg/service/controllers/manager.go` | Reviewed, no issues (Go 1.25 per-iteration loop variables make the closure capture of `endpoint`/`httpController` correct) |
---

### Codex GPT 5.4

### Executive Summary

This PR adds Docker-only log passthrough over WebSocket and threads the new endpoint through the engine controller, manager, and BATS coverage. I did not find a merge-blocking correctness issue in the implementation itself; the remaining issues are both API-contract mismatches introduced by the new docs/validation surface.

### Findings

**Critical Issues**

None.

**Important Issues**

None.

**Suggestions**

S1. **Docs promise UTF-8 messages, but the endpoint streams raw binary chunks** — `docs/design/api_containers.md:263-266` (suggestion, regression)

```md
An endpoint at `/passthrough/.../logs/${container}` will speak WebSocket;
messages are one way, as stream of bytes; messages should not be buffered.
Message text must be UTF-8 encoded.  The last portion of the path must be the
full container ID.
```

The new contract at `api_containers.md` lines 263-266 says this is a byte stream, but it also says each message text is UTF-8. The implementation does not preserve text boundaries: `websocketWriter.Write()` at `container_passthrough.go` lines 45-55 emits one `websocket.BinaryMessage` per raw `Write`, and the copy path at lines 224-227 feeds it directly from `io.Copy` / `stdcopy.StdCopy`. Those writers can split anywhere, so a frame can end mid-rune or contain arbitrary non-text bytes from the container log stream. A client that follows the new doc literally will build the wrong framing assumptions.

Fix: either document this as an opaque binary byte stream with no frame-to-line or frame-to-UTF-8 guarantee, or add buffering that emits only validated text units.

```diff
-An endpoint at `/passthrough/.../logs/${container}` will speak WebSocket;
-messages are one way, as stream of bytes; messages should not be buffered.
-Message text must be UTF-8 encoded.  The last portion of the path must be the
-full container ID.
+An endpoint at `/passthrough/.../logs/${container}` will speak WebSocket.
+Messages are one way and binary; frame boundaries are transport artifacts and
+do not correspond to log lines or UTF-8 rune boundaries. The last portion of
+the path must be the full container ID.
```

S2. **“Full container ID” is documented, but the handler accepts any hex prefix** — `pkg/controllers/app/engine/controllers/container_passthrough.go:37-38` (suggestion, regression)

```go
var containerIDValidator = regexp.MustCompile(`^[0-9a-fA-F]+$`)

if !containerIDValidator.MatchString(containerID) {
	http.Error(w, "Invalid container ID", http.StatusBadRequest)
	return
}
```

The validator at `container_passthrough.go` lines 37 and 63-66 only checks for hex characters, while the new design text at `api_containers.md` lines 265-266 says the path “must be the full container ID.” Requests like `/passthrough/engine/logs/deadbeef` therefore pass validation, hit the mirror lookup at lines 71-74, and come back as `404 Container not found` instead of a validation `400`. That makes client bugs harder to diagnose and leaves the API contract weaker than the documentation claims.

Fix: either enforce the canonical full Docker ID length here, or relax the documentation to say that prefixes are accepted but only full mirrored IDs will resolve.

```diff
-var containerIDValidator = regexp.MustCompile(`^[0-9a-fA-F]+$`)
+var containerIDValidator = regexp.MustCompile(`^[0-9a-fA-F]{64}$`)
```

### Design Observations

**Strengths**

- `HandleLogs()` defers the WebSocket upgrade until after the K8s lookup, TTY inspect, and `ContainerLogs` open. That keeps bad parameters and missing containers on normal HTTP status codes instead of forcing clients to decode WebSocket close frames.
- Inspecting `Config.Tty` before decoding the log stream matches Docker’s own CLI strategy and is the right workaround for `ContainerLogs` not exposing the response media type.

### Testing Assessment

Focused verification passed: `make build-rdd` and `go test ./pkg/controllers/app/engine/... ./pkg/service/controllers ./pkg/controllers/mock/...`.

1. `bats/tests/32-app-controllers/engine-docker.bats:175-292` covers only success paths for the new endpoint. Unlike the mock passthrough suite at `bats/tests/34-containers-controllers/containers-mock.bats:75-79`, it does not exercise invalid IDs or missing containers, so the new `400`/`404` mapping can regress silently.
2. No test drives stderr output through the non-TTY path. The demux branch at `container_passthrough.go:226-227` is exercised indirectly, but there is no assertion that stderr is preserved rather than dropped or misframed.
3. No test covers non-UTF-8 or binary log payloads, which matters as long as the design doc keeps describing message text semantics.

### Documentation Assessment

`docs/design/api_containers.md:263-266` needs to be aligned with the actual wire format. Right now it mixes “stream of bytes” semantics with per-message UTF-8 guarantees, and it overstates path validation by requiring a full container ID that the handler does not enforce.

### Acknowledged Limitations

- **PR comment**: @mook-as noted that `Client.ContainerLogs` does not expose the response media type, so the handler cannot detect TTY-vs-multiplexed output from the log call itself; `ContainerAttach` was rejected because it does not provide historical logs. The current `ContainerInspect`-first implementation is consistent with that rationale.
- **PR comment**: @jandubois recommended inspecting the container first, branching on `Config.Tty`, and returning `404` before upgrading the WebSocket. The current `HandleLogs()` / `hasTTY()` flow is consistent with that guidance.

### Coverage Summary

- `bats/tests/32-app-controllers/engine-docker.bats` — Reviewed, no issues
- `docs/design/api_containers.md` — S1
- `pkg/controllers/app/engine/controller.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/container_passthrough.go` — S1, S2
- `pkg/controllers/app/engine/controllers/docker_watcher.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/engine_reconciler.go` — Reviewed, no issues
- `pkg/controllers/mock/logs_passthrough.go` — Trivial
- `pkg/service/controllers/manager.go` — Reviewed, no issues

---

### Gemini 3.1 Pro

An exceptionally clean and robust PR. The implementation properly addresses the complexities of WebSocket lifecycle management, keep-alives, and TTY multiplexing. The author perfectly captured the recommended `ContainerInspect` pattern to avoid mid-stream WebSocket failures when the container backend state drifts from the Kubernetes mirror state.

### Critical Issues
None.

### Important Issues
**I1. Pre-existing readiness signal fires prematurely and swallows discovery errors** — `pkg/service/controllers/manager.go:194-203` (important, gap)

```go
	// Register controller manager in cluster for service discovery before
	// running the controllers.
	if err := scm.registerDiscovery(ctx); err != nil {
		log.Error(err, "Failed to register controller manager for discovery")
		// Don't fail startup for discovery registration errors
	}

	// Mark the control plane ready after registerDiscovery so clients
	// waiting on the ready annotation see both CRDs installed and
	// controller registration written, not just CRDs.
	if err := MarkControlPlaneReady(ctx, scm.discovery.client); err != nil {
```

Per the documented constraints in the repository instructions, `MarkControlPlaneReady` advertises that every enabled controller is reconciling and webhooks are listening. However, this path fires the annotation immediately, *before* `mgr.Start()` actually starts the controllers and webhook listeners. Furthermore, it explicitly swallows a `registerDiscovery` failure and proceeds to mark the control plane ready anyway, advertising readiness over a partially populated or missing ConfigMap.

**Fix:** Move the `MarkControlPlaneReady` call into a `manager.Runnable` (or trigger it post-start), and only fire it if the discovery registration succeeded. 

### Suggestions
**S1. containerID validation allows uppercase hex but K8s Kinds require lowercase** — `pkg/controllers/app/engine/controllers/container_passthrough.go:37-38` (suggestion, gap)

```go
var containerIDValidator = regexp.MustCompile(`^[0-9a-fA-F]+$`)
```

The validated `containerID` string is immediately passed directly as the `Name` field in a Kubernetes `client.Get()` lookup. Kubernetes resource names must be valid DNS subdomains (lowercase alphanumeric and `-`). While valid Docker container IDs naturally conform (they are lowercase hex strings), the validator explicitly permits uppercase `A-F`. If an uppercase ID is submitted, the API server will reject the object name, falling into the `default` error block and resulting in a 500 Internal Server Error rather than a clean 400 Bad Request.

**Fix:** Restrict the regex to strictly lowercase hex characters to match K8s validation:
```diff
-var containerIDValidator = regexp.MustCompile(`^[0-9a-fA-F]+$`)
+var containerIDValidator = regexp.MustCompile(`^[0-9a-f]+$`)
```

### Design Observations

**Strengths**
- **Graceful WebSocket Termination**: The LIFO `defer` stack in `HandleLogs` correctly manages Gorilla WebSocket `CloseMessage` sending. It gracefully coordinates between client-initiated closures and server-initiated closures without cluttering the logs with `websocket.ErrCloseSent` noise.
- **Race-Free TTY Check**: As requested in prior review comments, the PR uses `ContainerInspect` to evaluate `hasTTY` *before* upgrading the connection to WebSocket. This cleanly avoids race conditions and provides a proper 404 response if the K8s mirror exists but the Docker container has already been purged.
- **Concurrency-Safe Timers**: The WebSocket keep-alive ping loop correctly leverages Go 1.23 timer semantics, safely resetting `pingInterval` directly from the `ReadMessage` pong-handler callback while asynchronously receiving from `timer.C`.

### Testing Assessment
1. **Unreachable WebSocket Handler Path**: `pkg/service/controllers/manager.go` contains defensive `if handler == nil` logic for `GetPassthroughHandler()`. In practice, the HTTP server won't handle requests until after `mgr.Start()`, meaning the handler is always instantiated via `RegisterWithManager()`. While safe, this 503 fallback scenario is currently untested.

### Documentation Assessment
No gaps identified. The changes to `docs/design/api_containers.md` accurately match the implementation semantics (`follow=true` default, raw byte streams).

### Coverage Summary
- `bats/tests/32-app-controllers/engine-docker.bats` — Reviewed, no issues
- `docs/design/api_containers.md` — Reviewed, no issues
- `pkg/controllers/app/engine/controller.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/container_passthrough.go` — S1
- `pkg/controllers/app/engine/controllers/docker_watcher.go` — Reviewed, no issues
- `pkg/controllers/app/engine/controllers/engine_reconciler.go` — Reviewed, no issues
- `pkg/controllers/mock/logs_passthrough.go` — Trivial
- `pkg/service/controllers/manager.go` — I1
