Deep Review: 20260417-123813-pr-320
| Date | 2026-04-17 12:38 |
| Repo | rancher-sandbox/rancher-desktop-daemon |
| Round | 2 |
| Author | @Nino-K |
| PR | #320 — Manage Docker CLI context on ContainerEngineReady transitions |
| Branch | docker-context |
| Commits | ee25067 Manage Docker CLI context on ContainerEngineReady transitionsfe0962c Add bats to support the docker context |
| Reviewers | Claude Opus 4.7 (effort: xhigh), Codex GPT 5.4 (effort: high), Gemini 3.1 Pro (effort: default) |
| Verdict | Merge with fixes — round 1's three structural bugs are gone, but the probe still silently steals the default context, an unrelated read-failure path can clobber the user's config, and BATS tests still mutate the developer's real ~/.docker. |
| Wall-clock time | 15 min 24 s |
Executive Summary ¶
Round 2 fixes every structural defect from round 1: the cancel-vs-write race in the probe goroutine is closed by contextProbeWg.Wait() plus a probeCtx.Err() recheck, the hardcoded unix:// scheme is replaced with full-host pass-through, updateDockerConfig now uses an atomic tmp-file + rename with a no-op fast path, the Windows divergence is walled off with runtime.GOOS == "windows" early returns, and every new error site logs through the scoped logf.FromContext(...) logger with .Error. The concurrency and lifecycle wiring (contextMu + contextProbeGen + contextProbeWg) is now solid.
What's left are six issues, none catastrophic but all material:
- Default context is still skipped without a probe. The condition
current != "" && current != "default"jumps straight to "promote ours" — so any user with Docker Engine on Linux pointed at the default/var/run/docker.sock(the most common Linux setup) will silently have theircurrentContextoverwritten the first timeContainerEngineReadyfires. - The probe drops TLS/SSH transport state. The host URL alone is not enough to dial a context with
TLSDataor SSH options; the probe creates a bare client, fails to connect, and silently overrides a working remote/TLS context. - A swallowed
getCurrentDockerContexterror can still clobber the user's config. When the first read fails transiently (EBUSY, antivirus lock on a network home, etc.), the code logs and proceeds withcurrent=""; the second read insideupdateDockerConfigsucceeds, and the user's chosencurrentContextis silently replaced. Reviewer @mook-as raised this and the author agreed but deferred — the fix is onereturn. - BATS tests still operate on the developer's real
~/.docker/config.json. NoHOME/DOCKER_CONFIGisolation; any mid-test failure leaves local Docker state mutated. DOCKER_CONFIGis ignored. Users with a relocated config dir (dev containers, CI, Nix profiles) get a context written under~/.dockerwhile their CLI reads from$DOCKER_CONFIG.- The Windows skip has no
TODO:or filed issue. @mook-as's approval was conditioned on it ("Just needs to have an issue filed to keep track of the Windows thing").
The hand-rolled context store still duplicates docker/cli semantics — each of I1, I2, I5, and the Windows gap flows from a piece of logic that github.com/docker/cli/cli/context/store already gets right. That trade-off was made deliberately, but the unfixed bugs above are all the same shape, so the design observation persists.
Verdict: Merge with fixes. I1 (default), I3 (clobber), I4 (BATS isolation), and I6 (Windows TODO) are merge-blockers; I2 and I5 can ship as follow-ups if the design observation about adopting the docker context store lands.
Critical Issues ¶
None.
Important Issues ¶
current, err := getCurrentDockerContext()
if err != nil {
log.Error(err, "Failed to read current Docker context")
}
// Probe the current context's host — not our own — to decide
// whether to promote ours. "default" and empty both mean no
// working context is set; treat them identically.
var healthy bool
if current != "" && current != "default" {
currentHost, err := getDockerContextHost(current)
if err != nil {
log.Error(err, "Failed to resolve current Docker context host", "context", current)
} else if currentHost != "" {
healthy = probeDockerContext(probeCtx, currentHost)
}
}
// Guard against writing currentContext after removeDockerContext has
// already cancelled probeCtx and deleted the context directory.
if !healthy && probeCtx.Err() == nil {
if err := setCurrentDockerContext(contextName); err != nil {
log.Error(err, "Failed to set current Docker context", "context", contextName)
The comment claims "" and "default" both mean "no working context", but default is a real Docker context: on Linux with Docker Engine installed it points to /var/run/docker.sock, which is typically running. By skipping the probe entirely for these values, the code leaves healthy=false and then unconditionally writes currentContext = "rancher-desktop-<suffix>", silently overriding the user's working local Docker daemon on the most common Linux configuration. Same gap on a freshly-installed Docker Desktop alternative where currentContext was never set.
This was raised as round 1 S2 and is unaddressed; Gemini correctly re-flags it (at Critical, downgraded here to Important because recovery is one docker context use default away).
Fix: probe the default endpoint instead of skipping it. The Docker client's FromEnv resolver picks up DOCKER_HOST and falls back to the platform default Unix socket:
var healthy bool
if current != "" && current != "default" {
currentHost, err := getDockerContextHost(current)
if err != nil {
log.Error(err, "Failed to resolve current Docker context host", "context", current)
} else if currentHost != "" {
healthy = probeDockerContext(probeCtx, currentHost)
}
+} else {
+ // "" or "default" → probe the implicit default endpoint
+ // (DOCKER_HOST or the platform default socket).
+ healthy = probeDockerContext(probeCtx, "")
}
And teach probeDockerContext to use dockerclient.FromEnv when the host is empty:
func probeDockerContext(ctx context.Context, host string) bool {
probeCtx, cancel := context.WithTimeout(ctx, dockerContextProbeTimeout)
defer cancel()
- cli, err := dockerclient.New(dockerclient.WithHost(host))
+ var opts []dockerclient.Opt
+ if host != "" {
+ opts = append(opts, dockerclient.WithHost(host))
+ } else {
+ opts = append(opts, dockerclient.FromEnv)
+ }
+ cli, err := dockerclient.New(opts...)
...
}
}
// getDockerContextHost returns the full Docker host URL (e.g. "unix:///path/to/docker.sock"
// or "tcp://192.168.1.1:2376") for the named context's docker endpoint.
// Returns an empty string if the context does not exist or has no docker endpoint.
func getDockerContextHost(name string) (string, error) {
path, err := contextMetaPath(name)
if err != nil {
return "", err
}
data, err := os.ReadFile(path)
if errors.Is(err, os.ErrNotExist) {
return "", nil
}
if err != nil {
return "", err
}
var meta dockerContextMeta
if err := json.Unmarshal(data, &meta); err != nil {
return "", err
}
return meta.Endpoints["docker"].Host, nil
}
// getCurrentDockerContext reads the currentContext field from ~/.docker/config.json.
// Returns an empty string if the file does not exist or no context is set.
func getCurrentDockerContext() (string, error) {
getDockerContextHost returns only Endpoints["docker"].Host, so any context that depends on extra transport state — TLS certificate paths in ~/.docker/contexts/tls/<hash>/, SSH options, mTLS, identity files — is probed as if it were a bare host URL. The client built with only WithHost cannot complete the TLS handshake or SSH dial, the Ping fails, and the goroutine treats the context as unhealthy and overrides it. This violates the PR's documented "leave healthy contexts alone" contract for any user whose currentContext is a TCP+TLS or SSH endpoint (a common pattern for CI runners and remote-engine setups).
Round 1's I2 fix moved the probe from unix://-prefix-stripping to full host pass-through, which closed the tcp:// URL-construction bug, but the deeper "context resolution requires more than the host URL" issue remained.
Fix: reconstruct the full Docker context client from the context store, not from the host URL alone. The smallest change is to read the TLS material from ~/.docker/contexts/tls/<hash>/docker/{ca,cert,key}.pem when present and pass it via WithTLSClientConfig. The cleaner change is to use github.com/docker/cli/cli/context/store (see Design Observations) so the probe inherits whatever the user configured.
getCurrentDockerContext error can clobber the user's currentContext on a transient read failure Claude¶
}
r.contextMu.Unlock()
cancel()
}()
current, err := getCurrentDockerContext()
if err != nil {
log.Error(err, "Failed to read current Docker context")
}
// Probe the current context's host — not our own — to decide
// whether to promote ours. "default" and empty both mean no
// working context is set; treat them identically.
var healthy bool
if current != "" && current != "default" {
currentHost, err := getDockerContextHost(current)
if err != nil {
log.Error(err, "Failed to resolve current Docker context host", "context", current)
} else if currentHost != "" {
healthy = probeDockerContext(probeCtx, currentHost)
}
}
// Guard against writing currentContext after removeDockerContext has
// already cancelled probeCtx and deleted the context directory.
if !healthy && probeCtx.Err() == nil {
if err := setCurrentDockerContext(contextName); err != nil {
log.Error(err, "Failed to set current Docker context", "context", contextName)
}
}
}()
}
// removeDockerContext cancels any in-flight probe, waits for it to finish,
// then resets the current context if it points at our instance and deletes
When getCurrentDockerContext fails (transient EBUSY/EIO, antivirus lock on a CIFS-mounted $HOME, fs snapshot, etc.), the code logs and proceeds with current="". healthy stays false, and setCurrentDockerContext is called. That call routes through updateDockerConfig, which reads the file again. If the second read succeeds, the user's other keys are preserved but their chosen currentContext is silently replaced with ours. The user picked a specific context deliberately; the next time they run docker ps it talks to our VM instead.
Reviewer @mook-as raised this at discussion_r3102364628 and @Nino-K agreed (discussion_r3102730266: "I guess it would make sense to return here, since an error here means the config file exists but could not be read or parsed") but deferred to a Windows follow-up. Fix is one return:
current, err := getCurrentDockerContext()
if err != nil {
- log.Error(err, "Failed to read current Docker context")
+ log.Error(err, "Failed to read current Docker context; skipping promotion to avoid clobbering user config")
+ return
}
The user can still address our context explicitly via DOCKER_CONTEXT=rancher-desktop-<N> docker ...; not promoting on a read failure is strictly safer than promoting from a stale "".
~/.docker/config.json without HOME / DOCKER_CONFIG isolation Codex Gemini¶
# Engine controller tests — verify that the engine controller mirrors Docker
# containers, images, and volumes into Container, Image, and Volume
# resources, and that deletions and spec.state changes are forwarded
# to Docker.
local_setup_file() {
# The Docker socket access pattern used by these tests is not yet wired
# up for Windows/WSL2.
skip_on_windows
# Deliberately skip setup_rdd_control_plane here: `rdd set` internally
# calls ensureServiceRunning, which is exactly the CLI path we want to
# exercise — the engine controller is part of the default controller
# set, so no explicit --controllers selection is needed.
rdd svc delete
rdd set running=true
run -0 rdd svc paths docker_socket
export DOCKER_HOST="unix://${output}"
# Mirror resources live in App.spec.namespace. Override RDD_NAMESPACE
# to whatever the App was created with so the test queries the same
# namespace the engine controller uses, regardless of CRD defaults.
RDD_NAMESPACE=$(rdd ctl get app app -o jsonpath='{.spec.namespace}')
export RDD_NAMESPACE
}
# --- Startup ---
@test "ContainerNamespace moby exists" {
local_setup_file exports DOCKER_HOST but does not isolate HOME or DOCKER_CONFIG. The "save → mutate → restore" inline block in the currentContext test rewrites the developer's real ~/.docker/config.json in place, and the restore only runs when control reaches the bottom of the test — any failed assertion above leaves currentContext cleared on the developer's machine. The base BATS teardown_file only stops rdd. There is no safety net.
This was raised as round 1 S5 and is unaddressed.
Fix (in order of preference):
- Override
DOCKER_CONFIG(orHOME) inlocal_setup_fileto a per-suite temp directory. Once I5 lands, exportingDOCKER_CONFIGis the cross-platform-clean version (Windows usesUSERPROFILE, notHOME). - Move the save/restore into a BATS
teardownso it runs on failure too. Theset -e/set -o pipefailframing insidetrymeans a failed assertion above the inline restore aborts the test body without touching the cleanup.
type dockerEndpointMeta struct {
Host string `json:"Host"`
SkipTLSVerify bool `json:"SkipTLSVerify"`
}
func dockerConfigDir() (string, error) {
home, err := os.UserHomeDir()
if err != nil {
return "", err
}
return filepath.Join(home, ".docker"), nil
}
// contextMetaPath returns the path to the meta.json file for the named context.
// Docker creates the directory name using sha256(contextName).
func contextMetaPath(name string) (string, error) {
dir, err := dockerConfigDir()
The Docker CLI honours $DOCKER_CONFIG to relocate the config directory; it's standard for dev containers, CI runners, Nix profiles, and any user who keeps $HOME clean. With this implementation, the context meta file lands under ~/.docker/contexts/meta/<hash>/ while the user's CLI reads from $DOCKER_CONFIG/contexts/meta/<hash>/, so the new context is invisible. The promote step writes to ~/.docker/config.json for the same reason, and the user's actual config dir is left untouched.
This was raised as round 1 S3 and is unaddressed.
Fix:
func dockerConfigDir() (string, error) {
+ if dir := os.Getenv("DOCKER_CONFIG"); dir != "" {
+ return dir, nil
+ }
home, err := os.UserHomeDir()
if err != nil {
return "", err
}
return filepath.Join(home, ".docker"), nil
}
Side benefit: the unit tests that currently do t.Setenv("HOME", ...) become t.Setenv("DOCKER_CONFIG", ...), which is portable to the Windows test runner where os.UserHomeDir() consults USERPROFILE.
// manageDockerContext creates the instance Docker context and, in a goroutine,
// probes the user's current context; if it is absent or unhealthy, switches
// the default to the new context. At most one probe runs at a time.
func (r *EngineReconciler) manageDockerContext(socketPath string) {
// Docker context management uses unix:// sockets; Windows requires
// npipe:// which is not yet implemented.
if runtime.GOOS == "windows" {
return
}
contextName := instance.Name()
log := logf.FromContext(r.watcherCtx).WithName("docker-context")
if err := createReplaceDockerContext(contextName, socketPath); err != nil {
log.Error(err, "Failed to create Docker context", "context", contextName)
@mook-as approved the PR conditionally (discussion_r3101962597): "Just needs to have an issue filed to keep track of the Windows thing I think." Neither a TODO: marker that grep TODO would surface nor a referenced issue number appears in the code. Without one, the asymmetry is invisible to a future contributor scanning for follow-up work.
The comment also understates the gap: only the URL scheme is called out, but probeDockerContext would also need a different client construction (named-pipe transport), and getCurrentDockerContext / createReplaceDockerContext would need npipe-aware host serialization.
Fix: file a tracking issue and reference it, or add a grep-able marker in both functions:
func (r *EngineReconciler) manageDockerContext(socketPath string) {
- // Docker context management uses unix:// sockets; Windows requires
- // npipe:// which is not yet implemented.
+ // TODO(windows): Docker context management uses unix:// sockets;
+ // Windows requires npipe:// in the meta file, an npipe-aware probe
+ // client, and platform-specific path handling. Tracked in #<N>.
if runtime.GOOS == "windows" {
return
}
Apply the same TODO + issue reference to removeDockerContext.
Suggestions ¶
}
sum := sha256.Sum256([]byte(name))
return filepath.Join(dir, "contexts", "meta", hex.EncodeToString(sum[:]), "meta.json"), nil
}
func createReplaceDockerContext(name, socketPath string) error {
path, err := contextMetaPath(name)
if err != nil {
return err
}
if err := os.MkdirAll(filepath.Dir(path), 0o700); err != nil {
return err
}
meta := dockerContextMeta{
Name: name,
Metadata: map[string]any{},
Endpoints: map[string]dockerEndpointMeta{
"docker": {Host: "unix://" + socketPath},
},
}
data, err := json.MarshalIndent(meta, "", "\t")
if err != nil {
return err
}
return os.WriteFile(path, append(data, '\n'), 0o600)
}
func deleteDockerContext(name string) error {
path, err := contextMetaPath(name)
if err != nil {
return err
updateDockerConfig (lines 207-225) goes to considerable lengths to do a tmp-file + rename for config.json. createReplaceDockerContext does not — os.WriteFile truncates before writing, so a crash mid-write leaves meta.json zero- or partial-length. The next start re-creates the file, so this is self-healing for our use, but a docker context inspect rancher-desktop-<N> from the user during the window returns a parse error.
}
dir := filepath.Dir(path)
if err := os.MkdirAll(dir, 0o700); err != nil {
return err
}
tmp, err := os.CreateTemp(dir, ".config.json.*")
if err != nil {
return err
}
if _, err := tmp.Write(append(out, '\n')); err != nil {
tmp.Close()
os.Remove(tmp.Name())
return err
}
if err := tmp.Chmod(0o600); err != nil {
tmp.Close()
os.Remove(tmp.Name())
return err
}
if err := tmp.Close(); err != nil {
os.Remove(tmp.Name())
return err
}
return os.Rename(tmp.Name(), path)
}
Fix: extract the tmp-file + rename pattern from updateDockerConfig into a writeFileAtomic(path string, data []byte, perm os.FileMode) helper and use it from both call sites. Keeps the atomicity invariant uniform across both files we own under ~/.docker/.
if !wantWatcher {
if watcherRunning {
log.Info("Stopping Docker watcher",
"running", running, "engine", app.Spec.ContainerEngine.Name)
r.stopWatcher()
} else {
// The watcher was never started or died on its own (e.g. the VM
// socket closed). stopWatcher was not called, so clean up the
// Docker context directly.
r.removeDockerContext()
}
terminalReason := "Stopped"
terminalStatus := metav1.ConditionFalse
terminalMessage := "Container engine stopped"
if running && !engineIsDocker {
// Report NotApplicable as Status=True so `rdd set
Every reconcile with !wantWatcher && !watcherRunning (containerd mode, or stopped) now calls removeDockerContext. The no-op fast path in updateDockerConfig (mutate returns false → no write) keeps this cheap, but it still issues a ReadFile on ~/.docker/config.json and an os.RemoveAll on the contexts dir on every App event for a user who has never run moby. Stable terminal state should be I/O-free.
Fix: gate removeDockerContext() behind the same !alreadyClean condition that already guards cleanupMirrorResources:
if !wantWatcher {
if watcherRunning {
log.Info("Stopping Docker watcher", ...)
r.stopWatcher()
- } else {
- // The watcher was never started or died on its own (e.g. the VM
- // socket closed). stopWatcher was not called, so clean up the
- // Docker context directly.
- r.removeDockerContext()
}
...
alreadyClean := !watcherDied && current != nil && current.Reason == terminalReason && current.Status == terminalStatus
if !alreadyClean {
+ if !watcherRunning {
+ r.removeDockerContext()
+ }
if err := r.cleanupMirrorResources(ctx); err != nil {
...
"gotest.tools/v3/assert"
)
// dockerTestPaths holds the config and contexts directory for a test.
type dockerTestPaths struct {
configFile string
contextsDir string
}
// newDockerTestDir creates a temp ~/.docker layout and returns its paths.
func newDockerTestDir(t *testing.T) dockerTestPaths {
t.Helper()
root := t.TempDir()
dockerDir := filepath.Join(root, ".docker")
assert.NilError(t, os.MkdirAll(dockerDir, 0o700))
return dockerTestPaths{
configFile: filepath.Join(dockerDir, "config.json"),
contextsDir: filepath.Join(dockerDir, "contexts"),
}
}
func Test_updateDockerConfig_set(t *testing.T) {
t.Parallel()
t.Run("creates file when absent", func(t *testing.T) {
contextsDir is assigned but never read by any test in the file (round 1 S9 was the same finding; still unaddressed). Drop it:
type dockerTestPaths struct {
configFile string
- contextsDir string
}
@@
return dockerTestPaths{
configFile: filepath.Join(dockerDir, "config.json"),
- contextsDir: filepath.Join(dockerDir, "contexts"),
}
assert_file_exists "${meta_file}"
run -0 jq -r '.Name' "${meta_file}"
assert_output "${context_name}"
run -0 rdd service paths docker_socket
local socket_path=${output}
run -0 jq -r '.Endpoints.docker.Host' "${meta_file}"
assert_output "unix://${socket_path}"
}
Both spellings work (svc is an alias). Line 22 in the same file uses rdd svc paths docker_socket, and the rest of the bats suite uses svc. Keep consistent:
- run -0 rdd service paths docker_socket
+ run -0 rdd svc paths docker_socket
local context_name="rancher-desktop-${RDD_INSTANCE}"
# Save and clear any existing currentContext so the probe finds no
# healthy context and promotes ours. Restored in teardown.
local saved_context
saved_context=$(jq -r '.currentContext // empty' "${HOME}/.docker/config.json" 2>/dev/null || true)
# Clear the current context and restart the engine so the probe runs fresh.
jq 'del(.currentContext)' "${HOME}/.docker/config.json" >"${HOME}/.docker/config.json.tmp" &&
mv "${HOME}/.docker/config.json.tmp" "${HOME}/.docker/config.json"
Repo BATS style forbids $(cmd) for assertion-feeding values: failures inside $(...) produce opaque BATS errors with no output, and the || true makes a real failure indistinguishable from a missing file. Reviewer @mook-as flagged the same line at discussion_r3096228639; unaddressed.
Fix: branch on file existence and capture via run -0:
- local saved_context
- saved_context=$(jq -r '.currentContext // empty' "${HOME}/.docker/config.json" 2>/dev/null || true)
+ local saved_context=""
+ if [[ -f "${HOME}/.docker/config.json" ]]; then
+ run -0 jq -r '.currentContext // empty' "${HOME}/.docker/config.json"
+ saved_context=${output}
+ fi
Once I4 lands, the whole save/restore block disappears (the DOCKER_CONFIG-isolated dir is discarded by the temp-dir cleanup).
Design Observations ¶
Concerns
- Hand-rolled Docker context store still duplicates docker/cli semantics
(in-scope)Codex Claude — Round 1 raised this; round 2 leaves it unchanged. Each of I1 (default-context resolution), I2 (transport reconstruction), I5 (DOCKER_CONFIG), and the Windows gap is a piece of logic thatgithub.com/docker/cli/cli/context/storealready gets right. The author argued in discussion_r3096716690 that the dependency cost is too high for ~30 lines we control, and the trade-off is defensible — but the unfixed bugs above are all the same shape, and each follow-up round is likely to add more of them. Worth a re-evaluation once a decision on TLS-context probing is made: writing a bug-free TLS resolver from scratch is harder than swallowing the dep. - *
ContainerEngineReady=Truefires before the probe completes*(future)Claude — @mook-as noted at discussion_r3096360169 that "ContainerEngineReady shouldn't be set until the the context is set up correctly." Themeta.jsonis written synchronously insidestartWatcherAndSyncbeforeConnectedis asserted, soDOCKER_CONTEXT=rancher-desktop-<N> docker ...works the momentContainerEngineReady=Truelands. Only thecurrentContextpromotion is async — which is unavoidable given a 3s probe timeout. Current behavior is consistent with the design in issue #318: promotion is an ergonomic effect, not a precondition for engine readiness. Worth a comment near the condition update so future readers don't try to "fix" the asynchrony.
Strengths
- Round 1's three structural bugs are cleanly fixed Claude Codex Gemini —
contextProbeWg+probeCtx.Err()recheck closes the cancel-vs-write race; full-host pass-through replaces theunix://strip-and-reapply for non-Unix endpoints;updateDockerConfignow uses tmp-file + rename with amutate(cfg) boolno-op fast path;runtime.GOOS == "windows"early-returns wall off both lifecycle entry points; every new error site logs through the scopedlogf.FromContext(...)logger with.Error. The fixes match the round 1 finding text closely. - Goroutine concurrency design is well-reasoned Claude Codex Gemini — the
contextProbeGen/contextProbeCancel/contextProbeWgtrio handles all the tricky races: superseding probes (gen compare in defer), blocking the cleanup until the probe cannot write, and theprobeCtx.Err()recheck right before the write. The field-level comments atengine_reconciler.go:95-107are genuinely helpful for future readers. - *
updateDockerConfigpreserves unrelated keys and writes atomically* Codex Claude — the right baseline for any mutation of~/.docker/config.json. The no-op fast path means stable terminal state (after I4 lands) costs zero I/O.
Testing Assessment ¶
The new unit tests cover the file-shape helpers in isolation (updateDockerConfig set/clear/no-op, createReplaceDockerContext, getCurrentDockerContext) and the BATS additions exercise create, async promote, and remove. The gaps are the same shape as round 1, concentrated on the probe goroutine and the integration-with-real-CLI paths:
- No test for the
getCurrentDockerContexterror path (I3) — a malformed or unreadableconfig.jsonshould not causesetCurrentDockerContextto silently clobber. Even with the I3 fix (early return), a regression test guarding the behavior is cheap. - No test for the default-context skip (I1) — a
currentContextof""or"default"pointing at a healthy default endpoint should remain selected. No unit or integration test exercises this. - No test for non-
unix://current contexts (I2) — TLS, SSH, and TCP probe behavior is unverified. - No test for the generation race — two back-to-back
manageDockerContextcalls verifying that the second probe wins and the first cannot write post-supersede. The subtlest piece of the concurrency design is currently untested. - No test for
removeDockerContextwith a probe in flight — the whole point ofcontextProbeWg.Wait()is to block delete until the probe finishes. A test that stages a slow probe (fake hung socket) and verifies the cleanup blocks would exercise this directly. - No
DOCKER_CONFIG-isolated test setup (I4) — mutating the developer's real~/.dockeris a structural test-isolation gap, not just a per-test bug. - Windows skip path (I6) — the bats suite has a file-level
skip_on_windows; the production-code Windows no-op is unexercised in any form. A unit test assertingmanageDockerContext(...)is a safe nil onruntime.GOOS == "windows"would close this until the npipe path lands.
Documentation Assessment ¶
docs/design/api_app.md:39anddocs/design/bootstrap.md:58still attribute Docker context creation to the App controller /rdd startflow. This PR places the responsibility in the engine reconciler. The mismatch was raised in round 1 and is unaddressed. Either update the docs or move the code (the engine reconciler is the right home — it ownsContainerEngineReady).- No godoc on
createReplaceDockerContext,deleteDockerContext,setCurrentDockerContext,clearCurrentDockerContext,dockerConfigDir, ordockerContextMeta/dockerEndpointMeta. The rest of the helpers in the same file have one. Add for consistency. manageDockerContext's docstring atengine_reconciler.go:273-275says "At most one probe runs at a time." Mechanically accurate via the generation counter; worth noting that the WaitGroup count can briefly exceed 1 during transitions (acknowledged in PR review), and thatWait()only cares about reaching zero.
Commit Structure ¶
Two commits:
ee25067Manage Docker CLI context on ContainerEngineReady transitions — Go code, unit testsfe0962cAdd bats to support the docker context — integration tests
Clean split, readable in either order; no fixup noise. Round 1 suggested squashing for git bisect cleanliness; round 2 keeps the split, which is also fine — the bats commit alone is a coherent addition. Not blocking.
Acknowledged Limitations ¶
- Windows not implemented —
engine_reconciler.go:277-281, 345-347: both lifecycle functions early-return onruntime.GOOS == "windows". The comment claims only the URL scheme is the gap; in practiceprobeDockerContextand the meta-file serialization would also need npipe-aware paths. See I6 for the missing TODO/issue marker. - Non-atomic
createReplaceDockerContext—docker_context.go:55-75: author consciously routedconfig.jsonthrough an atomic helper but leftmeta.jsonwith a directos.WriteFile. See S1 for the gap. contextProbeWgcount can briefly exceed 1 —engine_reconciler.go:301-303: @mook-as flagged this at discussion_r3102603375; @Nino-K confirmed at discussion_r3102753949 thatWait()only cares about reaching zero. Analysis agrees: the set-currentContext write is idempotent (samecontextNamefrom a stableinstance.Name()), so a second goroutine writing the same value is harmless.- No file-level locking around
~/.docker/config.json—docker_context.go:175: @mook-as raised and self-resolved at discussion_r3102609534 since the Docker CLI itself does not lock; equivalent guarantee is the best we can offer. - Not adopting
github.com/docker/cli'sContextStore—docker_context.go:25: @Nino-K's response at discussion_r3096716690 explains the dep cost. Defensible, but see Design Concerns — the unfixed bugs are all in scope of what the dep would handle. - One-shot probe (no continuous reassertion) — discussion_r3101959340: a polling loop fighting the user's later context changes was explicitly rejected. Implementation matches.
Unresolved Feedback ¶
- discussion_r3101962597 (@mook-as — TODO or filed issue for Windows): promoted to I6 — @mook-as's approval was conditioned on it.
- discussion_r3102364628 (@mook-as — continue-on-read-failure clobber risk): promoted to I3 — the author agreed in discussion_r3102730266 but deferred. The fix is one
return. - discussion_r3096228639 (@mook-as —
run -0instead of$()): promoted to S5 — unaddressed. - discussion_r3096360169 (@mook-as — ContainerEngineReady timing): see Design Observations. Architectural; not a regression.
- discussion_r3096335657 (@mook-as —
yq --inplace): low-priority. The currentjq … > tmp && mv tmppattern is atomic on the same filesystem on Unix, so safe in practice. Will be obviated by I4's HOME/DOCKER_CONFIG isolation.
Agent Performance Retro ¶
[Claude] Took the longest by far and used the budget on round 2-specific work: read the PR review thread carefully and promoted two unresolved-but-agreed reviewer comments (the swallowed read error → I3, the missing Windows TODO → I6) into Important findings the other agents missed entirely. I3 is the most consequential finding in this round — silent data corruption on a transient read failure, with reviewer concurrence on the fix — and only Claude raised it. Caught two cleanup misses that round 1 also flagged but the author left in place (contextsDir unused, rdd service vs rdd svc). Did not re-flag the default-context-skip (round 1 S2) or DOCKER_CONFIG miss (round 1 S3) — those were both unaddressed and a thorough re-read of round 1's open suggestions would have caught them.
[Codex] Fastest of the three, with the tightest finding-to-noise ratio. Independently identified the TLS/SSH transport drop (I2) as the deeper version of round 1's I2 fix — a real upgrade in framing, since the simple "non-unix:// URL" framing from round 1 has been closed but the broader transport-state issue remained. Caught the BATS isolation regression (I4 with Gemini). Did not raise the default-context skip, the DOCKER_CONFIG miss, the swallowed-read-error clobber, or the missing Windows TODO. Spent ~5MB of stderr on its session log; the actual review was 109 lines.
[Gemini] Correctly re-flagged the two round-1 suggestions the author skipped (default-context skip → C1, DOCKER_CONFIG ignored → I2). Severity calibration was hot again: marked the default skip as Critical (downgraded here to Important — recoverable via one CLI command) and produced one false positive (C2: "os.Rename fails on Windows when destination exists") that cited non-existent "system instructions" and contradicts Go's well-documented Windows behavior (MoveFileEx with MOVEFILE_REPLACE_EXISTING). The hallucinated source is a calibration concern; the underlying finding does not survive verification. Did not flag the swallowed-read-error or the missing Windows TODO.
Summary ¶
| Claude Opus 4.7 | Codex GPT 5.4 | Gemini 3.1 Pro | |
|---|---|---|---|
| Duration | 8m 10s | 7m 13s | 4m 51s |
| Findings | 2I 4S | 2I 1S | 2I 1S |
| Tool calls | 34 (Bash 16, Grep 12, Read 6) | 56 (shell 52, stdin 4) | 18 (runshellcommand 11, grepsearch 6, writefile 1) |
| Design observations | 4 | 3 | 2 |
| False positives | 0 | 0 | 1 |
| Unique insights | 4 | 1 | 2 |
| Files reviewed | 4 | 4 | 4 |
| Coverage misses | 0 | 0 | 0 |
| Totals | 2I 4S | 2I 1S | 2I 1S |
| Downgraded | 0 | 0 | 1 (I→S) |
| Dropped | 0 | 0 | 1 |
Three of the most material round-2 findings (I3 swallowed-read clobber, I1 default-skip, I6 Windows TODO) were each found by exactly one agent — no agent caught all three. The ensemble's value here was the union, not the intersection: any single-agent run would have left at least one Important issue on the floor. Claude's PR-review-thread reading was the highest-leverage technique this round.
Reconciliation ¶
- Gemini C1 default-context skip → Important (I1). Recoverable with one CLI command; not data loss.
- Gemini C2
os.RenameWindows failure → rejected as false positive. Go's Windowsos.RenameusesMoveFileExwithMOVEFILE_REPLACE_EXISTINGand replaces existing files; the cited "system instructions" do not exist in this repo.
Review Process Notes ¶
Skill improvements
- On a re-review (Round ≥ 2), explicitly re-check every Suggestion from the prior round against the current diff. Suggestions that were not addressed should either be re-raised at the same severity (with a "round N S<X> unaddressed" note) or explicitly demoted with reasoning. The author has had a chance to triage them; persistent suggestions are a stronger signal than first-time ones, and silently dropping them lets unaddressed issues accumulate across rounds. In this review, two prior-round suggestions (default-context skip, DOCKER_CONFIG) survived only because Gemini happened to re-find them — Claude and Codex both missed them. A re-check pass against the prior report would catch these systematically rather than relying on agent recall.
- Reject findings that cite non-existent "system instructions" or repo conventions. Treat unverified citations as a calibration signal — when an agent grounds a finding in a fabricated source, the underlying claim itself often does not survive verification (the agent reached for evidence because the claim felt thin). Specifically: when a finding's framing relies on a documented behavior or a repo rule, verify the rule exists before accepting the finding. If the citation is hallucinated, downgrade to "verify independently" rather than accept on the agent's confidence alone.