Deep Review: 20260422-110156-pr-323
| Date | 2026-04-22 11:01 |
| Repo | rancher-sandbox/rancher-desktop-daemon |
| Round | 1 |
| Author | @mook-as |
| PR | #323 — implement log passthrough |
| Commits | 71b57d2 Address (AI) review comments78e9315 implement log passthrough |
| Reviewers | Claude Opus 4.7 (effort: xhigh), Codex GPT 5.4 (effort: high), Gemini 3.1 Pro (effort: default) |
| Verdict | Merge with fixes — one genuine data race on r.apiNamespace is worth addressing, plus a handful of clarity and test-coverage cleanups |
| Wall-clock time | 38 min 3 s |
Executive Summary ¶
PR #323 adds a WebSocket log-passthrough endpoint (/passthrough/engine/logs/{container}) for the Docker backend, wiring it through the engine controller's HandleLogs handler and exposing the engine's shared-mux entry via GetPassthroughEndpoints. The second commit addresses earlier AI-review feedback: binary frame encoding, pre-upgrade TTY detection via ContainerInspect, websocket ping/pong keep-alive, tail/follow query parameters, watcher-mutex access, and late handler binding to tolerate startup ordering.
The implementation is structurally sound and the review consensus is positive. No critical defects surfaced. One important issue: the new HTTP handler reads r.apiNamespace from arbitrary goroutines without synchronization, violating the struct's documented single-reconciler invariant. The rest are suggestions — mainly doc drift (API contract promises UTF-8 text while the wire carries raw binary; "full container ID" claimed but any hex string is accepted), untested negative paths for the new endpoint, and a mirror-creation race in the new BATS tests that will surface as a CI flake.
Structure: 1 important (data race), 11 suggestions (including 2 merged from multiple agents), 1 design concern (blank-import gap), several acknowledged limitations tied to pending containerd work.
Critical Issues ¶
None.
Important Issues ¶
http.Error(w, "Invalid container ID", http.StatusBadRequest)
return
}
log.V(5).Info("Handling logs for container", "containerID", containerID)
var c containersv1alpha1.Container
err := r.Client.Get(req.Context(), types.NamespacedName{
Namespace: r.apiNamespace,
Name: containerID,
}, &c)
if err != nil {
switch {
case apierrors.IsNotFound(err):
log.V(5).Info("Container not found", "container", containerID)
http.Error(w, "Container not found", http.StatusNotFound)
r.apiNamespace is written by Reconcile at pkg/controllers/app/engine/controllers/engine_reconciler.go:165 with no lock, and is now read here from HTTP-handler goroutines with no lock. go test -race will flag it, and a weak-memory-model architecture could produce a torn or empty read. In practice apiNamespace is effectively immutable after the first reconcile (per the field doc at engine_reconciler.go:122-124), so today's worst observable outcome is a briefly-empty-string read that yields a spurious 404 before the first Reconcile has run — but the race is real.
var app appv1alpha1.App
if err := r.Get(ctx, client.ObjectKey{Name: appName}, &app); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
r.apiNamespace = app.GetResourceNamespace()
running := meta.IsStatusConditionTrue(app.Status.Conditions, appv1alpha1.AppConditionRunning)
engineIsDocker := app.Spec.ContainerEngine.Name == engineMoby
// Treat a dead watcher as a transient disconnect and fall through.
client.Client
// reconcileChan receives events from the engine watcher goroutine.
reconcileChan chan event.GenericEvent
// apiNamespace mirrors App.spec.namespace (immutable). Reconcile
// populates it before any mirror operation.
apiNamespace string
// watcherMu guards r.watcher; note that this may be held for a long time
// during initialization.
watcherMu sync.Mutex
watcher engine
The struct-level comment at engine_reconciler.go:113-115 claims "Only Reconcile and the manager's shutdown-hook goroutine... contend for the fields below." That invariant is now false — HandleLogs is a third concurrent accessor. The field comment at line 126-127 was simultaneously rewritten from a precise description of what watcherMu guards against to a generic note about hold duration, so the struct's concurrency contract is now less documented than before the change (see S1).
// EngineReconciler watches the App resource for the Running condition and
// manages an engine watcher goroutine that mirrors engine state into K8s.
//
// The App is a cluster-scoped singleton, so controller-runtime runs at
// most one Reconcile at a time. Only Reconcile and the manager's
// shutdown-hook goroutine (see SetupWithManager) contend for the
// fields below.
type EngineReconciler struct {
client.Client
// reconcileChan receives events from the engine watcher goroutine.
reconcileChan chan event.GenericEvent
log.V(5).Info("Invalid argument", "error", err)
http.Error(w, "Invalid argument", http.StatusBadRequest)
default:
log.V(5).Info("Failed to inspect container", "error", err)
http.Error(w, "Failed to inspect container", http.StatusInternalServerError)
}
return
}
reader, err := watcher.getLogs(ctx, &c, opts...)
if err != nil {
switch {
Fix: acquire watcherMu once at entry and read watcher and apiNamespace under it; update the struct-level comment and the watcherMu field comment to name HandleLogs as a concurrent reader. Reordering also lets callers see the more-specific 503 before the 404, which helps diagnosis (see S6).
- var c containersv1alpha1.Container
- err := r.Client.Get(req.Context(), types.NamespacedName{
- Namespace: r.apiNamespace,
- Name: containerID,
- }, &c)
+ r.watcherMu.Lock()
+ watcher := r.watcher
+ apiNamespace := r.apiNamespace
+ r.watcherMu.Unlock()
+ if watcher == nil {
+ log.V(5).Info("Docker watcher not running")
+ http.Error(w, "Docker watcher not running", http.StatusServiceUnavailable)
+ return
+ }
+
+ var c containersv1alpha1.Container
+ err := r.Client.Get(req.Context(), types.NamespacedName{
+ Namespace: apiNamespace,
+ Name: containerID,
+ }, &c)
(important, regression)
Suggestions ¶
// EngineReconciler watches the App resource for the Running condition and
// manages an engine watcher goroutine that mirrors engine state into K8s.
//
// The App is a cluster-scoped singleton, so controller-runtime runs at
// most one Reconcile at a time. Only Reconcile and the manager's
// shutdown-hook goroutine (see SetupWithManager) contend for the
// fields below.
type EngineReconciler struct {
client.Client
// reconcileChan receives events from the engine watcher goroutine.
reconcileChan chan event.GenericEvent
// apiNamespace mirrors App.spec.namespace (immutable). Reconcile
// populates it before any mirror operation.
apiNamespace string
// watcherMu guards r.watcher; note that this may be held for a long time
// during initialization.
watcherMu sync.Mutex
watcher engine
// watcherCtx is the parent context for every engine watcher the
// reconciler starts. A manager.RunnableFunc cancels it on
Pre-PR the field comment read: "watcherMu guards r.watcher against the shutdown-hook goroutine's stopWatcher call. Reconcile runs serially (see struct doc), so the lock has no other role." The rewrite dropped the writer-identification (the shutdown hook) and the serial-Reconcile note in exchange for a hold-time remark that was equally true of the old code. After this PR, HandleLogs is a third concurrent reader, so the comment should grow more specific, not less.
Fix: enumerate the three concurrent participants (Reconcile, shutdown-hook goroutine, HandleLogs) and state which fields each one touches. Update the struct doc at lines 113-115 in the same pass.
(suggestion, regression)
The GUI is the intended caller for these actions. CLI users should
reach for `docker start`, `docker stop`, etc. instead: the engine
mirrors Docker state back into `status.status` either way.
#### Fetch container logs
An endpoint at `/passthrough/.../logs/${container}` will speak WebSocket;
messages are one way, as stream of bytes; messages should not be buffered.
Message text must be UTF-8 encoded. The last portion of the path must be the
full container ID.
The following query parameters are accepted:
Parameter | Description | Default
--- | --- | ---
The doc mixes two incompatible contracts: "stream of bytes" plus per-message UTF-8. The implementation provides only the former. websocketWriter.Write at pkg/controllers/app/engine/controllers/container_passthrough.go:45-56 emits one websocket.BinaryMessage per raw Write, fed by io.Copy or stdcopy.StdCopy at lines 224-227. Those writers can split anywhere, so a frame can end mid-rune or contain arbitrary non-text bytes from the container stream. A client that codes to "Message text must be UTF-8 encoded" will build wrong framing assumptions.
// with each Write call resulting in a separate [websocket.BinaryMessage].
type websocketWriter struct {
conn *websocket.Conn
}
func (w *websocketWriter) Write(p []byte) (int, error) {
_ = w.conn.SetWriteDeadline(time.Now().Add(writeTimeout))
writer, err := w.conn.NextWriter(websocket.BinaryMessage)
if err != nil {
return 0, err
}
n, err := writer.Write(p)
if err == nil {
err = writer.Close()
}
return n, err
}
// HandleLogs implements the log endpoint to pass through container logs.
// It is at `/passthrough/engine/logs/{container}[?tail=999]`.
func (r *EngineReconciler) HandleLogs(w http.ResponseWriter, req *http.Request) {
log := ctrl.LoggerFrom(req.Context())
`ContainerRequest` objects at the same time for the same containerd
`name`/`namespace` pair.
#### Change container state
Set the `containers.rancherdesktop.io/action` annotation on the
`Container` to request a one-shot action. The engine reconciler reads
the annotation, dispatches the matching Docker call, records the
outcome in `status.lastAction`, and removes the annotation.
Valid values: `start`, `stop`, `pause`, `unpause`, `restart`.
A single annotation holds at most one pending action. Writing a new
value replaces the old, so a user who requests `pause` and then
Fix: document this as an opaque binary stream with no frame-to-line or frame-to-UTF-8 guarantee. The exec endpoint (line 276-278) references "unbuffered binary as in the logs endpoint", so align the wording there once logs is fixed. The sentence also reads ungrammatically — "as stream of bytes" wants an article — worth polishing while the doc is open.
-An endpoint at `/passthrough/.../logs/${container}` will speak WebSocket;
-messages are one way, as stream of bytes; messages should not be buffered.
-Message text must be UTF-8 encoded. The last portion of the path must be the
-full container ID.
+An endpoint at `/passthrough/.../logs/${container}` will speak WebSocket.
+Messages are one way and binary; frame boundaries are transport artifacts and
+do not correspond to log lines or UTF-8 rune boundaries. The last portion of
+the path must be the full container ID.
(suggestion, regression)
writeTimeout = 10 * time.Second
pingInterval = 10 * time.Second
pingTimeout = time.Minute
)
var containerIDValidator = regexp.MustCompile(`^[0-9a-fA-F]+$`)
// websocketWriter implements [io.Writer] writing to a websocket connection,
// with each Write call resulting in a separate [websocket.BinaryMessage].
type websocketWriter struct {
conn *websocket.Conn
}
Two independent observations converge on the same regex:
The API doc at docs/design/api_containers.md:265-266 says "the full container ID" is required, but the validator accepts any hex prefix. /passthrough/engine/logs/deadbeef passes the 400 gate and hits the mirror lookup at lines 71-74, returning 404 and making client bugs harder to diagnose.
mirrors Docker state back into `status.status` either way.
#### Fetch container logs
An endpoint at `/passthrough/.../logs/${container}` will speak WebSocket;
messages are one way, as stream of bytes; messages should not be buffered.
Message text must be UTF-8 encoded. The last portion of the path must be the
full container ID.
The following query parameters are accepted:
Parameter | Description | Default
--- | --- | ---
return
}
log.V(5).Info("Handling logs for container", "containerID", containerID)
var c containersv1alpha1.Container
err := r.Client.Get(req.Context(), types.NamespacedName{
Namespace: r.apiNamespace,
Name: containerID,
}, &c)
if err != nil {
switch {
case apierrors.IsNotFound(err):
log.V(5).Info("Container not found", "container", containerID)
http.Error(w, "Container not found", http.StatusNotFound)
The validator also accepts uppercase A–F. A Kubernetes object name must be a DNS-1123 subdomain (lowercase only), so r.Client.Get rejects DEADBEEF... as Invalid. That error is neither IsNotFound nor IsInvalidArgument as the handler understands them, so it falls to the default branch and returns 500 instead of 400.
Fix: tighten the regex to the canonical Docker ID shape, which is 64 lowercase hex characters:
-var containerIDValidator = regexp.MustCompile(`^[0-9a-fA-F]+$`)
+var containerIDValidator = regexp.MustCompile(`^[0-9a-f]{64}$`)
Or relax the doc to say prefixes are accepted but only full mirrored IDs resolve.
(suggestion, regression)
// with each Write call resulting in a separate [websocket.BinaryMessage].
type websocketWriter struct {
conn *websocket.Conn
}
func (w *websocketWriter) Write(p []byte) (int, error) {
_ = w.conn.SetWriteDeadline(time.Now().Add(writeTimeout))
writer, err := w.conn.NextWriter(websocket.BinaryMessage)
if err != nil {
return 0, err
}
n, err := writer.Write(p)
if err == nil {
err = writer.Close()
}
return n, err
}
// HandleLogs implements the log endpoint to pass through container logs.
// It is at `/passthrough/engine/logs/{container}[?tail=999]`.
func (r *EngineReconciler) HandleLogs(w http.ResponseWriter, req *http.Request) {
log := ctrl.LoggerFrom(req.Context())
When writer.Write(p) errors, the per-message writer is not closed. Gorilla's beginMessage then rejects the next NextWriter call with "websocket: concurrent write to websocket connection" until the connection is torn down. Today this is benign: stdcopy.StdCopy and io.Copy both stop on the first write error, so no next call happens. If the writer is reused outside stdcopy (or a future caller retries), the leak turns subsequent writes into a cryptic error.
Fix: always close the writer and surface the first non-nil error.
n, err := writer.Write(p)
- if err == nil {
- err = writer.Close()
- }
- return n, err
+ if closeErr := writer.Close(); err == nil {
+ err = closeErr
+ }
+ return n, err
(suggestion, regression)
err = writer.Close()
}
return n, err
}
// HandleLogs implements the log endpoint to pass through container logs.
// It is at `/passthrough/engine/logs/{container}[?tail=999]`.
func (r *EngineReconciler) HandleLogs(w http.ResponseWriter, req *http.Request) {
log := ctrl.LoggerFrom(req.Context())
containerID, _, _ := strings.Cut(strings.TrimLeft(req.URL.Path, "/"), "/")
if !containerIDValidator.MatchString(containerID) {
log.V(5).Info("Invalid container ID", "container", containerID)
http.Error(w, "Invalid container ID", http.StatusBadRequest)
follow is parsed at lines 89-97 but absent from the godoc. docs/design/api_containers.md:273 lists both parameters, so this is the only place a Go caller would see a stale contract.
The following query parameters are accepted:
Parameter | Description | Default
--- | --- | ---
`tail` | Only print the given number of lines (before following). | All
`follow` | Follow the log stream. | `true`
#### Exec (shell) in container
An endpoint at `/passthrough/.../exec` will speak WebSocket; messages are
bidirectional, unbuffered binary as in the logs endpoint. Any text must be
UTF-8 encoded.
return
}
var opts []engineLogOptions
if f := req.URL.Query().Get("follow"); f != "" {
if follow, err := strconv.ParseBool(f); err == nil {
opts = append(opts, engineLogWithFollow(follow))
} else {
log.V(5).Info("Invalid follow parameter", "follow", f, "error", err)
http.Error(w, "Invalid follow parameter", http.StatusBadRequest)
return
}
}
if tail := req.URL.Query().Get("tail"); tail != "" {
opts = append(opts, engineLogWithTail(tail))
}
r.watcherMu.Lock()
Fix: extend the doc to list both parameters, or point at the design doc.
(suggestion, regression)
func (c *controller) GetCRDData() string {
return ""
}
func (c *controller) GetPassthroughEndpoints() []string {
return slices.Sorted(maps.Keys(c.passthroughs))
}
func (c *controller) GetPassthroughHandler(endpoint string) http.Handler {
if c.reconciler == nil {
return nil
}
The engine returns a sorted slice; pkg/controllers/mock/mock_controller.go:81 and pkg/controllers/app/demo/controller.go:77 return slices.Collect output with no ordering. The interface doc at pkg/controllers/base/controller.go:64-70 leaves order undefined, so nothing breaks, but discovery ConfigMap payloads and log lines flicker for controllers with multiple endpoints. Pick one convention and apply it everywhere.
func (c *controller) GetCRDData() string {
return ""
}
func (c *controller) GetPassthroughEndpoints() []string {
return slices.Collect(maps.Keys(c.passthroughs))
}
func (c *controller) GetPassthroughHandler(endpoint string) http.Handler {
return c.passthroughs[endpoint]
}
func (c *controller) GetCRDData() string {
return demoCRD
}
func (c *controller) GetPassthroughEndpoints() []string {
return slices.Collect(maps.Keys(c.passthroughs))
}
func (c *controller) GetPassthroughHandler(endpoint string) http.Handler {
return c.passthroughs[endpoint]
}
//
// This is intended for controllers that need to provide more complex HTTP
// functionality that cannot be provided using custom resources, such as
// handling WebSocket connections to stream data.
type PassthroughController interface {
// GetPassthroughEndpoints returns the list of HTTP endpoints provided by
// this controller. These endpoints will be exposed on the API server under
// the `/passthrough/<controller>/<endpoint>/` path; for example, an
// endpoint named "logs" on the "example" controller will be accessible at
// `/passthrough/example/logs/`. The endpoint names returned must not
// contain duplicates, and must be a valid path segment (i.e., no slashes).
GetPassthroughEndpoints() []string
// GetPassthroughHandler returns the HTTP handler for the given endpoint.
// The handler does not get the `/passthrough/<controller>/<endpoint>`
// prefix; for example, for an endpoint "logs" on the "example" controller,
// if the client requests `/passthrough/example/logs/id` the handler will
Fix: switch mock and demo to slices.Sorted for stable rdd ctl get configmap/rdd-controller-manager diffs, or document the interface as order-undefined and accept flicker.
(suggestion, gap)
}
r.watcherMu.Lock()
watcher := r.watcher
r.watcherMu.Unlock()
if watcher == nil {
log.V(5).Info("Docker watcher not running")
http.Error(w, "Docker watcher not running", http.StatusServiceUnavailable)
return
}
ctx, cancel := context.WithCancel(req.Context())
defer cancel()
watcher == nil covers two very different conditions: App stopped, and App.spec.containerEngine.name == "containerd" (see engine_reconciler.go:197 where wantWatcher = running && engineIsDocker). A GUI user on the containerd backend sees a transient and misleading "not running" message. Containerd support is explicitly planned (comment on engineMoby at engine_reconciler.go:51-54).
// mirrorFinalizer is added to mirror resources so user deletions
// are forwarded to the container engine before the resource is removed.
mirrorFinalizer = "engine.rancherdesktop.io/mirror"
// engineMoby is the App.spec.containerEngine.name value that selects
// the Docker backend. Containerd has no watcher yet and reports
// NotApplicable.
engineMoby = "moby"
)
// engineLogOptionsData holds the options for fetching container logs.
type engineLogOptionsData struct {
tail string
// backend. Any other state stops the watcher and sweeps mirror
// resources. Skip the sweep once ContainerEngineReady already
// reflects the terminal state, to avoid four empty List calls per
// unrelated reconcile; a failed sweep leaves the condition pending
// and the next requeue retries.
wantWatcher := running && engineIsDocker
if !wantWatcher {
if watcherRunning {
log.Info("Stopping Docker watcher",
"running", running, "engine", app.Spec.ContainerEngine.Name)
r.stopWatcher()
Fix: distinguish the two cases. Read App.Spec.ContainerEngine.Name inline, or stash a small reason when the reconciler tears the watcher down, and return e.g. "log passthrough is not supported on the containerd backend" with 501 Not Implemented.
(suggestion, gap)
command='
echo "hello"
sleep 1
echo "world"
'
run_e -0 docker run --detach --tty --name test-logs-tty busybox /bin/sh -c "${command}"
container=${output}
do_websocket "/passthrough/engine/logs/${container}"
# `--tty` means the output is cooked, so LF has been converted to CRLF.
assert_line hello$'\r'
assert_line world$'\r'
}
Every other new test waits for the mirror with rdd ctl wait --for=create container/${cid} before touching it (see "docker run creates Container resource" at line 140-143). The three logs tests skip that wait. On a slow VM the Docker-event-to-mirror path can trail the Docker API by long enough for r.Client.Get to return NotFound, producing a 404 and a failed assertion. The sleep 1 inside the busybox command does not help because the test issues the request before the sleep runs.
@test "docker run creates Container resource with status=running" {
run_e -0 docker run -d --name test-lifecycle busybox sleep inf
cid=${output}
rdd ctl wait --for=create --namespace="${RDD_NAMESPACE}" \
container/"${cid}" --timeout=30s
rdd ctl wait --for=jsonpath='{.status.status}'=running \
--namespace="${RDD_NAMESPACE}" container/"${cid}" --timeout=10s
run -0 rdd ctl get container "${cid}" --namespace="${RDD_NAMESPACE}" \
-o jsonpath='{.status.name}'
assert_output "test-lifecycle"
Fix: prefix each do_websocket call with rdd ctl wait --for=create --namespace="${RDD_NAMESPACE}" container/"${container}" --timeout=30s.
(suggestion, gap)
return
}
var opts []engineLogOptions
if f := req.URL.Query().Get("follow"); f != "" {
if follow, err := strconv.ParseBool(f); err == nil {
opts = append(opts, engineLogWithFollow(follow))
} else {
log.V(5).Info("Invalid follow parameter", "follow", f, "error", err)
http.Error(w, "Invalid follow parameter", http.StatusBadRequest)
return
}
}
if tail := req.URL.Query().Get("tail"); tail != "" {
opts = append(opts, engineLogWithTail(tail))
}
r.watcherMu.Lock()
The 400 branches (invalid follow, non-hex container ID) and the 404 branch (unknown container ID) have no test coverage in bats/tests/32-app-controllers/engine-docker.bats. The mock passthrough suite at bats/tests/34-containers-controllers/containers-mock.bats:75-79 already exercises a similar pattern, so one curl assertion each lands cheaply and guards against status-code regressions.
run -0 curl --silent --header "Authorization: Bearer ${token}" --insecure \
"${server_url/http/ws}/passthrough/mock/logs/${container}"
assert_line "Logs for container ${container}"
assert_line "Label: ${label}"$'\t'"${value}"
# Check that invalid containers are rejected.
run -22 curl --silent --header "Authorization: Bearer ${token}" --insecure \
"${server_url/http/ws}/passthrough/mock/logs/invalid-container-id"
run -22 curl --silent --header "Authorization: Bearer ${token}" --insecure \
"${server_url/http/ws}/passthrough/mock/logs/00000000"
}
@test "images are created" {
rdd ctl wait --for=create namespace rdd-mocks --timeout=30s
Fix: add three assertions — invalid follow=maybe, non-hex container ID, and hex-but-unknown ID.
(suggestion, gap)
if err != nil {
log.V(5).Info("Failed to upgrade to WebSocket", "error", err)
return
}
defer conn.Close()
conn.SetReadLimit(4096)
defer func() {
message := websocket.FormatCloseMessage(websocket.CloseNormalClosure, "Closing connection")
err := conn.WriteControl(websocket.CloseMessage, message, time.Now().Add(controlTimeout))
if err == nil {
// Allow for websocket to close gracefully
The other five timing-related literals at the top of the file (controlTimeout, writeTimeout, pingInterval, pingTimeout) are named. This one deserves the same treatment, both for consistency and because the rationale — clients are not expected to send anything large; 4096 covers PONGs and any future client-to-server protocol — is not obvious at the call site.
Fix: lift to a named constant alongside the others, with a short comment on the rationale.
(suggestion, regression)
MarkControlPlaneReady fires before controllers start, and silently proceeds on discovery-registration failure Gemini¶
return fmt.Errorf("failed to set up ready check: %w", err)
}
// Register controller manager in cluster for service discovery before
// running the controllers.
if err := scm.registerDiscovery(ctx); err != nil {
log.Error(err, "Failed to register controller manager for discovery")
// Don't fail startup for discovery registration errors
}
// Mark the control plane ready after registerDiscovery so clients
// waiting on the ready annotation see both CRDs installed and
// controller registration written, not just CRDs.
if err := MarkControlPlaneReady(ctx, scm.discovery.client); err != nil {
return fmt.Errorf("failed to mark control plane as ready: %w", err)
}
// Ensure cleanup on shutdown with a timeout to avoid blocking if apiserver is gone
defer func() {
cleanupCtx, cleanupCancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cleanupCancel()
The annotation advertises readiness at line 201, but the passthrough server (added as a manager.Runnable at line 215) does not start listening until mgr.Start(ctx) at line 284. A client that sees the ready annotation and races to connect hits connection-refused on the new passthrough port. Discovery-registration failure is also swallowed, so the annotation may fire over an empty or partial discovery ConfigMap.
log.Error(err, "Failed to unregister controller manager")
}
}()
// Add pass through server
if err := mgr.Add(manager.RunnableFunc(scm.runPassthroughServer)); err != nil {
return fmt.Errorf("failed to add pass through server to manager: %w", err)
}
// Register all controllers before launching webhook goroutines.
// RegisterWithManager calls AddToScheme, which writes to the scheme map.
"controllers", len(scm.registrations),
"metricsPort", scm.metricsPort,
"healthPort", scm.healthPort)
// Start the manager (this blocks until context is cancelled)
if err := mgr.Start(ctx); err != nil {
return fmt.Errorf("failed to start shared controller manager: %w", err)
}
return nil
}
The PR does not modify this block, but the new passthrough endpoint is precisely the new surface that exposes the readiness race — the handler == nil defensive check the PR added at the newly-modified lines 479-486 acknowledges the same ordering concern at a finer grain. Worth addressing together rather than layering more workarounds.
prefix := fmt.Sprintf("/%s/%s", registration.GetName(), endpoint)
log.V(2).Info("Registering passthrough endpoint",
"controller", registration.GetName(), "endpoint", prefix)
mux.Handle(prefix+"/", http.StripPrefix(
prefix,
http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
handler := httpController.GetPassthroughHandler(endpoint)
if handler == nil {
http.Error(w, "handler not ready", http.StatusServiceUnavailable)
return
}
handler.ServeHTTP(w, r)
})))
}
}
}
if !hasPassthroughServers {
Fix: move MarkControlPlaneReady into a post-mgr.Start manager.Runnable, and only fire it when discovery registration succeeded.
(suggestion, gap)
Design Observations ¶
Strengths ¶
- Late WebSocket upgrade —
pkg/controllers/app/engine/controllers/container_passthrough.go:147-154(in-scope) Claude Codex Gemini. All three reviewers called this out: the handler defers the upgrade until after the K8sGet, DockerContainerInspect, andContainerLogsopen succeed. Errors flow back through real HTTP status codes instead of post-handshake WebSocket close frames, which is much easier for clients to handle. - Pre-upgrade TTY detection via
ContainerInspect—pkg/controllers/app/engine/controllers/docker_watcher.go:463-470(in-scope) Codex Gemini.Client.ContainerLogsdoes not expose media type, so inspectingConfig.Ttyfirst is the correct way to pick between raw copy andstdcopy.StdCopy. Matches Docker's own CLI strategy. - Graceful ping/pong keep-alive —
pkg/controllers/app/engine/controllers/container_passthrough.go:176-220(in-scope) Gemini. Correct use of Go 1.23 timer semantics; reset on pong and on successful read; cleanly canceled via the shared context. - Late
GetPassthroughHandlerbinding in the mux —pkg/service/controllers/manager.go:479-486(in-scope) Claude. Moving the handler lookup inside thehttp.HandlerFuncclosure tolerates the startup window whenc.reconcileris nil and returns 503 instead of panicking.
Concerns ¶
- The engine controller is not blank-imported by
cmd/app-controller/main.goorcmd/containers-controller/main.go(future) Claude. Onlypkg/service/cmd/service.go:59blank-importspkg/controllers/app/engine. Per the repo's deep-review steering, "bothcmd/*/main.goand the embedded service package blank-import the controller." Running the engine reconciler as an out-of-tree process would silently drop its passthrough endpoint with no signal to the operator. Pre-existing gap, but more visible now that the passthrough endpoint exists.
Testing Assessment ¶
Integration coverage for the happy paths is reasonable: the new BATS cases exercise TTY and non-TTY log streams, tail, and follow=false. The gaps are concentrated in three areas:
- Mirror-creation race in the new tests (S8) — the most likely source of CI flakes.
- Negative paths untested (S9) — invalid
follow, non-hex container ID, hex-but-unknown ID. All reachable; none asserted. - Ping/pong keep-alive uncovered — tests complete in under a minute, so
pingInterval = 10s/pingTimeout = 60snever fire. Not a regression, but the keep-alive is the most intricate new code. - Stderr demux on the non-TTY path — the
stdcopy.StdCopybranch atcontainer_passthrough.go:227is exercised indirectly; no test asserts stderr preservation. - Log request on the containerd backend — no test for the 503 path (which today carries the misleading message from S7).
make build-rdd and go test ./pkg/controllers/app/engine/... ./pkg/service/controllers ./pkg/controllers/mock/... reported green during review (Codex).
Documentation Assessment ¶
docs/design/api_containers.md:262-273describes the new endpoint and its query parameters, but mixes incompatible framing guarantees (binary stream plus per-message UTF-8, see S2) and promises a "full container ID" the handler does not enforce (see S3).pkg/controllers/app/engine/controllers/container_passthrough.go:58-59— godoc forHandleLogsliststailbut notfollow(see S5).pkg/controllers/app/engine/controllers/engine_reconciler.go:113-127— struct doc andwatcherMufield comment went stale against the new HTTP-handler concurrency (see I1 and S1).- Minor prose: line 264 "as stream of bytes" wants "as a stream of bytes"; line 277 "unbuffered binary as in the logs endpoint" is ungrammatical. Easy polish while the doc is open.
Commit Structure ¶
Two commits, cleanly split: the first introduces the minimal implementation; the second addresses prior review feedback with an enumerated changelog. No concerns.
Acknowledged Limitations ¶
- Containerd backend returns 503 for log requests — containerd mirroring is still on the roadmap (see
engine_reconciler.go:51-54). S7 covers the user-facing message today. Client.ContainerLogsdoes not expose media type — raised by @mook-as in the PR thread. The currentContainerInspect+Config.Ttyapproach is the resolution and matches Docker's own CLI behavior.- TTY detection via inspect is a point-in-time read — the container's TTY configuration is immutable after create, so this is not a race the handler can lose.
Unresolved Feedback ¶
None. The three in-thread PR comments either landed as changes in the second commit or are captured under Acknowledged Limitations.
Agent Performance Retro ¶
[Claude] — The deepest and most actionable pass: surfaced the only important finding (the apiNamespace data race), traced it to the stale struct-level doc and rewritten watcherMu field comment, and proposed a fix that also improves diagnostics by reordering the 404/503 branches. Caught two gap-level issues the other agents missed (the BATS mirror-creation race and the engine controller's missing blank-imports). Its suggestions lean toward clarity and test-coverage cleanups, which fits a PR this tight.
[Codex] — Focused and precise, but narrower than the others: limited to two doc/contract mismatches and nothing on the implementation itself. Both findings landed — UTF-8 framing and the over-permissive container ID regex — and the latter merged cleanly with Gemini's uppercase-hex observation. Ran make build-rdd and the package tests to back its "no merge-blocker in the implementation" verdict, which the consolidation accepted.
[Gemini] — Two useful angles the others missed: the pre-existing MarkControlPlaneReady readiness race that the new passthrough endpoint exposes, and the uppercase-hex case of the container ID regex. The MarkControlPlaneReady finding was labeled important, but the code it points to is not changed in this PR; consolidation downgraded it to a suggestion (S11) framed around the PR's new surface. Hit a sandbox denial trying to write a throwaway test file outside the worktree (/tmp/testtimer.go); recovered without impact.
Summary ¶
| Claude Opus 4.7 | Codex GPT 5.4 | Gemini 3.1 Pro | |
|---|---|---|---|
| Duration | 21m 00s | 5m 38s | 8m 31s |
| Findings | 1I 8S | 2S | 2S |
| Tool calls | 53 (Bash 30, Read 21, Grep 2) | 44 (shell 41, stdin 3) | 32 (runshellcommand 16, grep_search 9, glob 2) |
| Design observations | 2 | 2 | 3 |
| False positives | 0 | 0 | 0 |
| Unique insights | 7 | 1 | 2 |
| Files reviewed | 8 | 8 | 8 |
| Coverage misses | 0 | 0 | 0 |
| Totals | 1I 8S | 2S | 2S |
| Downgraded | 0 | 0 | 1 (I→S) |
| Dropped | 0 | 0 | 0 |
Claude delivered the most value: the only important finding, the most unique insights, and the design-level observations that tied findings together. Codex and Gemini each contributed one useful suggestion that the others missed. The three reviews reinforced each other on the strengths (late upgrade, pre-upgrade TTY detection) without redundantly repeating them.
Reconciliation ¶
Gemini P1 "Pre-existing readiness signal" (important, on unchanged code): important → suggestion S11. Rationale: the code is unchanged by this PR; the PR's new passthrough endpoint is the new surface exposing the pre-existing race, so it belongs as a suggestion tied to the PR rather than a blocker for it.
Review Process Notes ¶
Repo context updates ¶
- Add a note to
deep-review-context.mdabout the engine controller's blank-import gap incmd/*/main.go. Reviewers should check this when a PR adds a new user-visible surface (API endpoint, passthrough handler, webhook) to a controller that is not yet registered in every standalone binary — a missing blank-import silently drops the feature in those binaries with no startup error. Apply when the PR adds a surface whose absence would not surface as a test failure.