Deep Review: 20260416-144802-pr-320
| Date | 2026-04-16 15:14 |
| Repo | rancher-sandbox/rancher-desktop-daemon |
| Round | 1 |
| Author | @Nino-K |
| PR | #320 — Manage Docker CLI context on ContainerEngineReady transitions |
| Branch | docker-context |
| Commits | 001fd95 Manage Docker CLI context on ContainerEngineReady transitionsef5114d Add bats to support the docker context |
| Reviewers | Claude Opus 4.7, Codex GPT 5.4, Gemini 3.1 Pro |
| Verdict | Rework — three serious correctness bugs in the probe and config-write paths must be fixed before merge. |
| Wall-clock time | 30 min 15 s |
Executive Summary ¶
PR 320 wires Docker CLI context creation/teardown into the engine reconciler: on ContainerEngineReady=True it writes a rancher-desktop-<suffix> context pointing at instance.DockerSocket(), and a background probe promotes that context to currentContext if the user's existing context is missing or unreachable. On the way back to False, the probe is cancelled and the context is removed.
The design (probe in a goroutine derived from watcherCtx, generation counter to deduplicate concurrent probes) is sensible, but three correctness bugs make this PR unsafe to ship as-is, and all three were identified independently by all reviewers:
- Probe-vs-cleanup race resurrects a deleted context. The goroutine does not check
probeCtx.Err()before writingcurrentContext, so a cancelled probe interprets cancellation as "current context is unhealthy" and writes our context name to~/.docker/config.jsonafterremoveDockerContexthas already cleared it and deleted the meta dir. The user'sdockerCLI then fails with "Current context … is not found" until they manually rundocker context use default. - Hardcoded
unix://breaks every non-Unix Docker context. The probe stripsunix://from the user's existing context Host and unconditionally re-prepends it before dialing, sotcp://,ssh://, andnpipe://contexts probe asunix://tcp://…, get marked unhealthy, and silently get overridden — Windows users and remote-Docker users lose their working context. - Non-atomic, repeated writes to
~/.docker/config.jsonrisk data loss.os.WriteFiletruncates before writing, andremoveDockerContextruns on every reconcile in the stopped state, so the user's auths/credsStore can be lost to a partial write or a race with a concurrentdocker login.
Two further important issues complete the must-fix list: the new code logs errors via logf.Log.Info(..., "error", err) instead of the scoped log.Error(...) pattern used everywhere else in the reconciler, and the hardcoded unix:// scheme in createReplaceDockerContext makes the context Docker CLI on Windows cannot open. Reviewer @mook-as raised the same Windows concern in inline review comments that remain unresolved.
The BATS tests mutate the developer's actual ~/.docker/config.json — they should run against an isolated HOME (or DOCKER_CONFIG) so a mid-test failure does not corrupt local Docker state. None of the reviewer's prior-round comments have been addressed.
Structure: the change is small (4 files, ~559 LoC) but split into one "code" commit and one "tests" commit; squash before merge.
Critical Issues ¶
None.
Important Issues ¶
if err := r.deleteAllOfType(ctx, &containersv1alpha1.ContainerList{}); err != nil {
errs = append(errs, fmt.Errorf("failed to delete Containers: %w", err))
}
if err := r.deleteAllOfType(ctx, &containersv1alpha1.VolumeList{}); err != nil {
errs = append(errs, fmt.Errorf("failed to delete Volumes: %w", err))
}
if err := r.deleteAllOfType(ctx, &containersv1alpha1.ImageList{}); err != nil {
errs = append(errs, fmt.Errorf("failed to delete Images: %w", err))
}
if err := r.deleteAllOfType(ctx, &containersv1alpha1.ContainerNamespaceList{}); err != nil {
errs = append(errs, fmt.Errorf("failed to delete ContainerNamespaces: %w", err))
}
return errors.Join(errs...)
}
// deleteAllOfType lists and removes every resource of the given kind
// in r.apiNamespace. Engine is authoritative for Container, Image, and
// Volume in the App's namespace: this loop deletes every object, not
// just engine-owned mirrors. Coexistence with another writer to these
// kinds requires an engine-owned label and matching filters here and
// in the sync_*.go full-sync prune paths. Finalizer removal retries
// on conflict; per-item errors are collected so one stuck object does
// not block the rest.
func (r *EngineReconciler) deleteAllOfType(ctx context.Context, list client.ObjectList) error {
if err := r.List(ctx, list, client.InNamespace(r.apiNamespace)); err != nil {
return err
}
items, err := meta.ExtractList(list)
if err != nil {
return err
}
var errs []error
for _, item := range items {
obj := item.(client.Object)
key := client.ObjectKeyFromObject(obj)
// meta.ExtractList leaves each item's TypeMeta empty (GVK is
// only on the list), so format the Go type with %T instead of
// obj.GetObjectKind().
retryErr := retry.RetryOnConflict(retry.DefaultRetry, func() error {
latest := obj.DeepCopyObject().(client.Object)
if err := r.Get(ctx, key, latest); err != nil {
if apierrors.IsNotFound(err) {
When the engine stops, removeDockerContext (line 328) cancels probeCtx, clears currentContext if it points at us, and deletes the context dir — without waiting for the probe goroutine to finish. The probe interprets the cancellation in probeDockerContext's Ping as "context is unhealthy" and proceeds to write setCurrentDockerContext(contextName). That write can land after removeDockerContext has already deleted the meta dir, leaving currentContext pointing at a non-existent context. Every subsequent docker ps/docker build/docker context inspect then fails with "Current context … is not found" until the user manually runs docker context use default.
// only on the list), so format the Go type with %T instead of
// obj.GetObjectKind().
retryErr := retry.RetryOnConflict(retry.DefaultRetry, func() error {
latest := obj.DeepCopyObject().(client.Object)
if err := r.Get(ctx, key, latest); err != nil {
if apierrors.IsNotFound(err) {
return nil
}
return err
}
if !controllerutil.RemoveFinalizer(latest, mirrorFinalizer) {
The generation counter (myGen) only protects the cancel-slot bookkeeping; the disk write on line 319 is not gated on probeCtx.Err(), generation match, or any wait on the cleanup path.
Fix: gate the promotion on probeCtx.Err(). A fully race-free fix also adds a sync.WaitGroup so removeDockerContext blocks until the probe returns:
go func() {
+ r.contextProbeWg.Add(1)
+ defer r.contextProbeWg.Done()
defer func() {
...
}()
...
if !healthy {
+ if probeCtx.Err() != nil {
+ return
+ }
if err := setCurrentDockerContext(contextName); err != nil {
...
}
}
}()
func (r *EngineReconciler) removeDockerContext() {
r.contextMu.Lock()
...
r.contextMu.Unlock()
+ r.contextProbeWg.Wait()
...
}
The wait is bounded by dockerContextProbeTimeout (3s) because probeCtx gets cancelled before the wait. The Err() check alone closes most of the window; the WaitGroup closes the remainder where the goroutine has read !healthy and is about to call setCurrentDockerContext when cancellation fires.
return nil
}
return err
}
// getDockerContextSocket returns the Unix socket path that the named context
// points at. Returns an empty string if the context does not exist or has no
// docker endpoint.
func getDockerContextSocket(name string) (string, error) {
path, err := contextMetaPath(name)
if err != nil {
return "", err
}
data, err := os.ReadFile(path)
if errors.Is(err, os.ErrNotExist) {
return "", nil
}
if err != nil {
return "", err
}
var meta dockerContextMeta
if err := json.Unmarshal(data, &meta); err != nil {
return "", err
}
host := meta.Endpoints["docker"].Host
return strings.TrimPrefix(host, "unix://"), nil
}
// getCurrentDockerContext reads the currentContext field from ~/.docker/config.json.
// Returns an empty string if the file does not exist or no context is set.
func getCurrentDockerContext() (string, error) {
dir, err := dockerConfigDir()
If the user's current context has Host: "tcp://foo:2375" (or ssh://, npipe://), getDockerContextSocket returns the string unchanged because the "unix://" prefix doesn't match. probeDockerContext then constructs "unix://tcp://foo:2375", the moby client parses it as proto=unix, addr="tcp://foo:2375", attempts to dial a Unix socket at the literal path, and fails. The probe reports healthy=false, and the goroutine silently overrides the user's working remote/Windows-pipe context with ours.
This violates the PR's documented contract ("if the current context is healthy, leave it alone") for every non-Unix endpoint and is the same class of bug Windows users will hit in I4.
Fix: stop stripping the scheme — pass the full Host URL through to the client. Rename for clarity:
-func getDockerContextSocket(name string) (string, error) {
+// getDockerContextHost returns the Docker Host URL (e.g. "unix:///path",
+// "tcp://host:port") that the named context points at.
+func getDockerContextHost(name string) (string, error) {
...
- host := meta.Endpoints["docker"].Host
- return strings.TrimPrefix(host, "unix://"), nil
+ return meta.Endpoints["docker"].Host, nil
}
-func probeDockerContext(ctx context.Context, socketPath string) bool {
+func probeDockerContext(ctx context.Context, host string) bool {
...
- cli, err := dockerclient.New(dockerclient.WithHost("unix://" + socketPath))
+ cli, err := dockerclient.New(dockerclient.WithHost(host))
...
}
}
// updateDockerConfig reads the Docker config file at path, applies mutate,
// and writes the result back. All other keys in the file are preserved.
// If the file does not exist and mutate leaves cfg empty, no file is written.
func updateDockerConfig(path string, mutate func(map[string]any)) error {
cfg := map[string]any{}
data, err := os.ReadFile(path)
notExist := errors.Is(err, os.ErrNotExist)
switch {
case err == nil:
if jsonErr := json.Unmarshal(data, &cfg); jsonErr != nil {
return jsonErr
}
case notExist:
// we should create the file below if needed
default:
return err
}
before := len(cfg)
mutate(cfg)
// file didn't exist and there is nothing left nothing to write.
if notExist && len(cfg) == 0 && before == 0 {
return nil
}
out, err := json.MarshalIndent(cfg, "", "\t")
if err != nil {
return err
}
if err := os.MkdirAll(filepath.Dir(path), 0o700); err != nil {
return err
}
return os.WriteFile(path, append(out, '\n'), 0o600)
}
updateDockerConfig uses os.WriteFile, which opens with O_TRUNC and writes in place. Two failure modes:
- Crash after truncate, before write completes: The user's
~/.docker/config.jsonis left empty or partial, losing every key it held (auths,credsStore,currentContext, plugin settings, etc.). - Race with concurrent docker CLI: A
docker loginordocker context userunning concurrently performs its own read-modify-write. Whichever write lands second silently clobbers the other's changes — the user loses an auth entry or a context switch with no error.
Severity is amplified by the design: removeDockerContext is called on every reconcile when the engine is stopped (see S1), so the file is read, re-marshalled, and rewritten on a steady cadence in the background even when nothing has changed.
Fix: write to a sibling temp file and rename(2) — atomic on POSIX, and fixable on Windows by removing the destination first. Also short-circuit when mutate produces no change so steady state pays no I/O cost:
-func updateDockerConfig(path string, mutate func(map[string]any)) error {
+func updateDockerConfig(path string, mutate func(map[string]any) bool) error {
cfg := map[string]any{}
...
- before := len(cfg)
- mutate(cfg)
-
- if notExist && len(cfg) == 0 && before == 0 {
+ if changed := mutate(cfg); !changed {
return nil
}
out, err := json.MarshalIndent(cfg, "", "\t")
...
- return os.WriteFile(path, append(out, '\n'), 0o600)
+ tmp, err := os.CreateTemp(filepath.Dir(path), ".config.json.*")
+ if err != nil {
+ return err
+ }
+ if _, err := tmp.Write(append(out, '\n')); err != nil {
+ tmp.Close()
+ os.Remove(tmp.Name())
+ return err
+ }
+ if err := tmp.Chmod(0o600); err != nil {
+ tmp.Close()
+ os.Remove(tmp.Name())
+ return err
+ }
+ if err := tmp.Close(); err != nil {
+ os.Remove(tmp.Name())
+ return err
+ }
+ return os.Rename(tmp.Name(), path)
}
The mutate callback returns bool so the caller can signal whether it actually changed anything — setCurrentDockerContext returns true only when the field changes, clearCurrentDockerContext only when the matching key existed. This collapses I3 with S1 (no rewrite-on-every-reconcile).
return err
}
if err := os.MkdirAll(filepath.Dir(path), 0o700); err != nil {
return err
}
meta := dockerContextMeta{
Name: name,
Metadata: map[string]any{},
Endpoints: map[string]dockerEndpointMeta{
"docker": {Host: "unix://" + socketPath},
},
}
data, err := json.MarshalIndent(meta, "", "\t")
if err != nil {
return err
}
instance.DockerSocket() returns a Windows path like %USERPROFILE%\.rd<suffix>\docker.sock; the resulting meta.json contains Host: "unix://C:\\Users\\…\\docker.sock", which Docker CLI on Windows cannot use (it expects npipe://./pipe/...). The BATS file skips Windows at line 15, but manageDockerContext runs unconditionally in production, so the controller will write a broken context for every Windows user. Reviewer @mook-as raised the same concern at discussion_r3096484641, which is unresolved.
"encoding/hex"
"encoding/json"
"errors"
"os"
"path/filepath"
"strings"
"time"
dockerclient "github.com/moby/moby/client"
)
Fix (smallest): gate manageDockerContext and removeDockerContext on runtime.GOOS != "windows" until the named-pipe path is wired up. Long-term: switch to github.com/docker/cli/cli/context/store, which constructs the platform-appropriate Host (see Design Concerns).
func (r *EngineReconciler) manageDockerContext(socketPath string) {
+ if runtime.GOOS == "windows" {
+ return
+ }
contextName := instance.Name()
...
}
Apply the same guard to removeDockerContext.
logf.Log.Info(..., "error", err) instead of the scoped log.Error(...) pattern Claude¶
if !changed {
return nil
}
return r.Status().Update(ctx, latest)
})
}
// cleanupMirrorResources removes every mirror resource the engine
// controller owns. Errors from all four kinds are joined so one stuck
// type does not block the rest.
func (r *EngineReconciler) cleanupMirrorResources(ctx context.Context) error {
Every other error site in this package uses log := logf.FromContext(ctx); log.Error(err, msg, kv...). The new code deviates on two axes:
.Infologs at info severity with an"error"field, so operators filtering on error level miss these and structured-log parsers do not promoteerrinto a dedicated field.logf.Logis the unscoped global logger.SetupWithManagerprimesr.watcherCtxwithlogf.IntoContext(..., mgr.GetLogger().WithName("engine-watcher")), but the new code ignores it — these lines have no controller name and no goroutine identity.
Fix: capture a scoped logger at goroutine launch and use .Error:
func (r *EngineReconciler) manageDockerContext(socketPath string) {
contextName := instance.Name()
+ log := logf.FromContext(r.watcherCtx).WithName("docker-context")
if err := createReplaceDockerContext(contextName, socketPath); err != nil {
- logf.Log.Info("Failed to create Docker context", "context", contextName, "error", err)
+ log.Error(err, "Failed to create Docker context", "context", contextName)
return
}
...
}
Apply the same pattern to removeDockerContext.
Suggestions ¶
// expect Container/Image/Volume mirrors must gate on
// Reason, not Status alone. The condition will be renamed
// when containerd mirroring lands.
terminalReason = "NotApplicable"
terminalStatus = metav1.ConditionTrue
terminalMessage = "Engine mirroring is only supported with the moby backend"
}
current := meta.FindStatusCondition(app.Status.Conditions, appv1alpha1.AppConditionContainerEngineReady)
alreadyClean := !watcherDied && current != nil && current.Reason == terminalReason && current.Status == terminalStatus
if !alreadyClean {
if err := r.cleanupMirrorResources(ctx); err != nil {
log.Error(err, "Failed to clean up mirror resources")
return ctrl.Result{}, err
}
}
return ctrl.Result{}, r.setEngineCondition(ctx, &app, terminalStatus, terminalReason, terminalMessage)
}
if !watcherRunning {
log.Info("App is running, starting Docker watcher")
if err := r.startWatcherAndSync(ctx); err != nil {
The else branch fires every reconcile while the engine is stopped, so removeDockerContext → clearCurrentDockerContext → updateDockerConfig reads, re-marshals, and rewrites ~/.docker/config.json continuously even in steady state. With the current non-atomic write (I3), this turns a single stale-config-cleanup into ongoing risk. Stable state should be no-I/O.
Fix: move removeDockerContext() inside the !alreadyClean block, or fold it into the same condition that gates cleanupMirrorResources. With I3's "no rewrite when nothing changed" path, the cost goes away automatically.
// just engine-owned mirrors. Coexistence with another writer to these
// kinds requires an engine-owned label and matching filters here and
// in the sync_*.go full-sync prune paths. Finalizer removal retries
// on conflict; per-item errors are collected so one stuck object does
// not block the rest.
func (r *EngineReconciler) deleteAllOfType(ctx context.Context, list client.ObjectList) error {
if err := r.List(ctx, list, client.InNamespace(r.apiNamespace)); err != nil {
return err
}
items, err := meta.ExtractList(list)
if err != nil {
return err
}
var errs []error
for _, item := range items {
obj := item.(client.Object)
key := client.ObjectKeyFromObject(obj)
// meta.ExtractList leaves each item's TypeMeta empty (GVK is
The comment claims "default" and "" both mean "no working context", but default is a real Docker context — on Linux with Docker Engine installed it points to /var/run/docker.sock, which may be running. The code treats it as broken and silently promotes ours. The PR's stated contract ("if the current context is healthy, leave it alone") would be violated on every Linux box with both Docker Engine and Rancher Desktop installed.
Fix: probe default explicitly via dockerclient.DefaultDockerHost (or client.FromEnv) instead of skipping it. The change is small once I2's "pass the full Host through" refactor is in place.
type dockerEndpointMeta struct {
Host string `json:"Host"`
SkipTLSVerify bool `json:"SkipTLSVerify"`
}
func dockerConfigDir() (string, error) {
home, err := os.UserHomeDir()
if err != nil {
return "", err
}
return filepath.Join(home, ".docker"), nil
}
// contextMetaPath returns the path to the meta.json file for the named context.
// Docker creates the directory name using sha256(contextName).
func contextMetaPath(name string) (string, error) {
dir, err := dockerConfigDir()
Docker CLI honours $DOCKER_CONFIG to relocate the config dir. Users who set it (common in dev containers, CI runners, or Nix profiles) will have the context written under ~/.docker while their CLI reads from $DOCKER_CONFIG, so the new context never appears.
func dockerConfigDir() (string, error) {
+ if dir := os.Getenv("DOCKER_CONFIG"); dir != "" {
+ return dir, nil
+ }
home, err := os.UserHomeDir()
if err != nil {
return "", err
}
return filepath.Join(home, ".docker"), nil
}
}
// getDockerContextSocket returns the Unix socket path that the named context
// points at. Returns an empty string if the context does not exist or has no
// docker endpoint.
func getDockerContextSocket(name string) (string, error) {
path, err := contextMetaPath(name)
if err != nil {
return "", err
}
data, err := os.ReadFile(path)
if errors.Is(err, os.ErrNotExist) {
return "", nil
}
if err != nil {
return "", err
}
var meta dockerContextMeta
if err := json.Unmarshal(data, &meta); err != nil {
return "", err
}
host := meta.Endpoints["docker"].Host
return strings.TrimPrefix(host, "unix://"), nil
}
// getCurrentDockerContext reads the currentContext field from ~/.docker/config.json.
// Returns an empty string if the file does not exist or no context is set.
func getCurrentDockerContext() (string, error) {
dir, err := dockerConfigDir()
If meta.json exists but has no docker endpoint (a kubernetes-only context, which Docker CLI does write), the map lookup returns a zero dockerEndpointMeta{} with Host="". The caller then takes the currentSocket != "" false branch, treats healthy=false, and overrides a context the user is actively using — just one without a docker endpoint.
Fix shares the I2 refactor: return the raw Host and let the caller distinguish "" → not probeable, leave alone from currentSocket missing → promote ours.
~/.docker/config.json without HOME isolation Claude Codex Gemini¶
docker rm --force test-restart
}
# --- Cleanup on shutdown ---
@test "stopping VM removes all mirror resources" {
# Make sure we have at least one resource to verify cleanup.
rdd ctl wait --for=create --namespace="${RDD_NAMESPACE}" \
ContainerNamespace/moby --timeout=10s
rdd set running=false
run -0 rdd ctl get containers --namespace="${RDD_NAMESPACE}" --output=name
refute_output
run -0 rdd ctl get images --namespace="${RDD_NAMESPACE}" --output=name
refute_output
run -0 rdd ctl get volumes --namespace="${RDD_NAMESPACE}" --output=name
refute_output
run -0 rdd ctl get ContainerNamespaces --namespace="${RDD_NAMESPACE}" --output=name
refute_output
}
@test "rdd set running=false returns promptly when already stopped" {
# rdd set running=false waits for the App's Running condition to
# go to False, not for ContainerEngineReady. When the VM is
# already stopped, the wait must complete immediately rather than
# hang on the already-False engine condition.
rdd set --timeout=10s running=false
}
@test "VM start recreates ContainerNamespace/moby after cleanup" {
# The "stopping VM removes all mirror resources" test above swept
# ContainerNamespace/moby along with the rest of the mirrors, so
# restarting the VM must recreate it. This bridges the teardown
# above to the ContainerNamespace delete test below, which needs
# the namespace to exist before it can delete it.
rdd set running=true
rdd ctl wait --for=create --namespace="${RDD_NAMESPACE}" \
ContainerNamespace/moby --timeout=10s
}
@test "deleting ContainerNamespace/moby completes without a finalizer hang" {
# moby ContainerNamespace has no mirror finalizer, so a user delete
# must return promptly rather than get trapped in Terminating.
rdd ctl delete ContainerNamespace/moby --namespace="${RDD_NAMESPACE}" --timeout=10s
run -0 rdd ctl get ContainerNamespaces --namespace="${RDD_NAMESPACE}" --output=name
refute_output
}
# --- containerd backend ---
@test "containerd backend reports ContainerEngineReady=NotApplicable and skips mirroring" {
# Stop first so there is no stale True/Connected from moby to
# satisfy the CLI wait below before the engine reconciler has run.
rdd set running=false
# Start with containerd. rdd set waits for ContainerEngineReady=True,
# which the engine reconciler satisfies immediately with reason
# NotApplicable because engine mirroring only supports the moby
# backend.
rdd set containerEngine.name=containerd running=true
run -0 rdd ctl get app app \
-o jsonpath='{.status.conditions[?(@.type=="ContainerEngineReady")].reason}'
assert_output "NotApplicable"
# No mirror resources should exist in containerd mode.
run -0 rdd ctl get containers --namespace="${RDD_NAMESPACE}" --output=name
refute_output
run -0 rdd ctl get images --namespace="${RDD_NAMESPACE}" --output=name
refute_output
run -0 rdd ctl get volumes --namespace="${RDD_NAMESPACE}" --output=name
refute_output
run -0 rdd ctl get ContainerNamespaces --namespace="${RDD_NAMESPACE}" --output=name
refute_output
}
The tests operate on the developer's actual $HOME/.docker/config.json (no helper overrides HOME). The "restore in teardown" inline block at lines 471–475 runs only when execution reaches it — any assert_output failure above it leaves currentContext cleared. There is no file-level teardown.
Fix options, in order of preference:
- Override
HOME(orDOCKER_CONFIG, after S3 lands) inlocal_setup_fileto a per-test temp dir; the engine controller picks it up when started by the test harness. - Move the save/restore into a BATS
teardownso it runs unconditionally.
refute_output
run -0 rdd ctl get ContainerNamespaces --namespace="${RDD_NAMESPACE}" --output=name
refute_output
}
@test "rdd set running=false returns promptly when already stopped" {
# rdd set running=false waits for the App's Running condition to
# go to False, not for ContainerEngineReady. When the VM is
# already stopped, the wait must complete immediately rather than
# hang on the already-False engine condition.
rdd set --timeout=10s running=false
Repo convention forbids $(cmd) substitution when the result feeds an assertion — failures inside $(...) produce opaque BATS errors with no output. Use run -0 <cmd>; var=$output. Reviewer @mook-as flagged the same line at discussion_r3096228639.
data, err := os.ReadFile(p.configFile)
assert.NilError(t, err)
var cfg map[string]any
assert.NilError(t, json.Unmarshal(data, &cfg))
_, hasCtx := cfg["currentContext"]
assert.Assert(t, !hasCtx, "currentContextshouldhavebeenremoved")
_, hasAuths := cfg["auths"]
assert.Assert(t, hasAuths, "auths key must be preserved")
})
}
-assert.Assert(t, !hasCtx, "currentContextshouldhavebeenremoved")
+assert.Assert(t, !hasCtx, "currentContext should have been removed")
The probe's central responsibility — "if the user's context is healthy, leave it alone" — has no integration test. Existing tests cover only "unset → promote ours" and "stop → remove". A test that stages a healthy alternative context (e.g. a second context name pointing at the same socket, set as current) and asserts currentContext is unchanged after the engine starts would catch I2 regressions and document the policy.
"gotest.tools/v3/assert"
)
// dockerTestPaths holds the config and contexts directory for a test.
type dockerTestPaths struct {
configFile string
contextsDir string
}
// newDockerTestDir creates a temp ~/.docker layout and returns its paths.
func newDockerTestDir(t *testing.T) dockerTestPaths {
t.Helper()
root := t.TempDir()
dockerDir := filepath.Join(root, ".docker")
assert.NilError(t, os.MkdirAll(dockerDir, 0o700))
return dockerTestPaths{
configFile: filepath.Join(dockerDir, "config.json"),
contextsDir: filepath.Join(dockerDir, "contexts"),
}
}
func Test_updateDockerConfig_set(t *testing.T) {
t.Parallel()
t.Run("creates file when absent", func(t *testing.T) {
contextsDir is assigned but never referenced by any test in the file. Drop the field, or convert it to a helper returning the meta-path that removes the t.Setenv("HOME", …) gymnastics in the create/delete tests.
Design Observations ¶
Concerns
- Hand-rolled Docker context store duplicates docker/cli and is the source of every correctness bug here
(in-scope)Claude Codex Gemini — The PR re-implements context directory hashing, JSON metadata serialization, default-context resolution, and atomic config-file mutation. Each of I2, I3, I4, S2, S3 flows directly from one of those re-implementations.github.com/docker/cli/cli/context/store(suggested by reviewer @mook-as at discussion_r3096468463) provides a tested implementation that constructs the platform-appropriate Host, atomically writes the metadata, and resolvesdefaultviaclient.FromEnv. The trade is one new dependency (already transitively present via moby/moby). Worth re-visiting before adding more code on top of the hand-written path. - Design docs still attribute Docker context creation to the App controller
(in-scope)Claude —docs/design/api_app.md:39anddocs/design/bootstrap.md:58say the App controller creates the Docker context during start. This PR places the responsibility in the engine reconciler instead, which is the right home (engine ownsContainerEngineReady), but the design docs now lie. Update the docs to match, or move the code to the App controller.
Strengths
- Generation counter for probe deduplication Claude Codex —
contextProbeGen + myGenis the right shape for deduplicating concurrent probes; a superseded goroutine cannot accidentally clear a newer probe's cancel slot. The pattern just needs to be extended to gate the disk write as well (I1). - *Probe lifetime tied to
watcherCtx, not Reconcile'sctx* Claude Codex — Correct: the probe outlives a single Reconcile call (a per-reconcile deadline must not kill it) but dies on manager shutdown. - *
!watcherRunningelse-branch handles the dead-watcher case* Claude — A watcher that died between reconciles would otherwise leave the Docker context behind forever; the explicitremoveDockerContextcall closes that hole. (S1 is about how often it then runs, not whether it should run at all.)
Testing Assessment ¶
The new unit tests cover the file I/O helpers in isolation (updateDockerConfig, createReplaceDockerContext, getCurrentDockerContext) and the BATS additions exercise the create-on-start and remove-on-stop end-to-end paths. The gaps are concentrated in the probe goroutine and the config-file integration:
- Probe-vs-cleanup race (I1) — no test triggers a stop while a probe is in flight; a test that stages a slow probe (e.g. via a context pointing at a socket that hangs) and then sets
running=falsewould reproduce the resurrection bug. - Non-Unix current context (I2) — no test stages
Host: "tcp://…"ornpipe://…incurrentContext. The probe behavior with those schemes is unverified. - Healthy-context-skip (S8) — the central policy is untested.
defaultcontext interaction (S2) — no test of a healthy/var/run/docker.sock(the realistic Linux scenario) being preserved.- Crash-during-write (I3) — no test for the partial-write recovery, though once
os.Rename-based atomicity is in place this becomes structurally impossible. - Repeated engine restarts —
manageDockerContextruns once per start in tests, never twice; the cancel-on-supersede path is unexercised. - Windows (I4) — entirely skipped via
skip_on_windows; production code runs unconditionally there.
Documentation Assessment ¶
docs/design/bootstrap.md:58anddocs/design/api_app.md:39still attribute Docker context creation to the App controller /rdd startflow. This PR moves the work to the engine reconciler. Either the docs must be updated, or the code belongs elsewhere (see Design Concerns).getDockerContextSocket's docstring atdocker_context.go:90-92describes the return as a "Unix socket path"; once I2 is fixed and the function returns a generic Host URL, the docstring needs to follow.manageDockerContext's docstring atengine_reconciler.go:268-270says "At most one probe runs at a time." Mechanically true via the generation cancel, but it does not warn about the I1 race withremoveDockerContext. Update once I1 is fixed (or note the limitation if the fix is deferred).
Commit Structure ¶
Two commits: 001fd95 (code) + ef5114d (tests). The tests verify the code in the preceding commit; squashing into one commit is conventional for this repo and keeps git bisect honest. Not blocking.
Unresolved Feedback ¶
Eight inline review comments from @mook-as remain unresolved on the current diff:
- discussion_r3096224461 (
bats:436) — useassert_file_existsinstead ofrun -0 test -f. Style. - discussion_r3096228639 (
bats:442) — userun -0 rdd …; var=$outputinstead of$(rdd …). Covered by S6. - discussion_r3096335657 (
bats:457) — consideryq --inplace. Optional; obviated by S5 if HOME isolation lands. - discussion_r3096360169 (
bats:465) — "ContainerEngineReady shouldn't be set until the context is set up correctly." Architectural: shouldContainerEngineReady=Connectedwait formanageDockerContextto complete, including the probe? Today it does not. Worth answering in the PR. - discussion_r3096439251 (
bats:489) — precomputecontext_dir, dropbash -c, useassert_dir_not_exists. Covered by S5/S6. - discussion_r3096468463 (
docker_context.go:26) — adoptgithub.com/docker/cli/cli/context/store. See Design Concerns; would resolve I2/I3/I4/S2/S3. - discussion_r3096484641 (
docker_context.go:90) — Windows: named pipes vsunix://. Covered by I4. - discussion_r3096522025 (
engine_reconciler.go:271) — reviewer self-correction; effectively resolved.
Agent Performance Retro ¶
[Claude] Produced the most thorough, finely-segmented review: separated the unix:// scheme bug from the Windows-specific Host construction (I2 vs I4), the "default" override (S2), the DOCKER_CONFIG env-var miss (S3), the empty-endpoint case (S4), the per-reconcile rewrite (S1), and the unscoped logf.Log logging (I5) — none of which Codex or Gemini surfaced individually. The cancel-race fix it proposed was the most defensive (gen-check + ctx-check + WaitGroup); the others stopped at the ctx.Err() check alone, which closes most but not all of the window. Verdict line was the only one to call out Windows production behavior as a blocker.
[Codex] Concise and correct. Caught the cancel-race and the unix:// bug independently, and explicitly bundled the "default" override into the same finding rather than separating it — a defensible call. Did not surface the non-atomic write, the Windows compile-but-broken issue, or the logging-style deviation. The fix proposals were the smallest possible diffs; useful when the priority is "minimum surface to land the fix."
[Gemini] Caught the same two consensus bugs as the others and uniquely flagged the non-atomic os.WriteFile data-loss path (I3) — the only finding no other agent raised. Severity calibration was hot: marked all three findings as Critical, where the consolidator downgraded to Important (recoverable in all three cases without data permanently lost on the median user system). The atomic-write framing was the most actionable, with a concrete tmp+rename patch.
| Claude Opus 4.7 | Codex GPT 5.4 | Gemini 3.1 Pro | |
|---|---|---|---|
| Duration | 18m 27s | 3m 34s | 3m 02s |
| Findings | 5I 9S | 2I 1S | 3I |
| Tool calls | 47 (Grep 19, Bash 14, Read 9) | 21 (shell 20, stdin 1) | 7 (grepsearch 3, runshellcommand 3, readfile 1) |
| Design observations | 5 | 3 | 1 |
| False positives | 0 | 0 | 0 |
| Unique insights | 5 | 0 | 1 |
| Files reviewed | 4 | 4 | 4 |
| Coverage misses | 0 | 0 | 0 |
| Totals | 5I 9S | 2I 1S | 3I |
| Downgraded | 0 | 0 | 3 (I→S) |
| Dropped | 0 | 0 | 0 |
Reconciliation ¶
- Gemini C1 cancel-race → Important (I1). Recoverable by
docker context use default; no persistent data loss. Same severity as Claude I1 / Codex I2. - Gemini C2 hardcoded
unix://→ Important (I2). Same severity as Claude I2 / Codex I1; "silent override of working remote context" is a major usability bug but not data loss. - Gemini C3 non-atomic write → Important (I3). Crash-window for partial write is real but narrow on the median user system; concurrent docker-CLI writer is plausible but not common. Atomic fix is trivial.
Review Process Notes ¶
Skill improvements
- When PR code re-implements a well-known external library's on-disk format (e.g. Docker context store, kubeconfig, OCI image layout), call it out as a structural risk in Design Observations even if no individual bug is yet visible. Hand-rolled re-implementations of stable third-party file formats tend to inherit a long tail of edge cases (platform-specific schemes, atomic-write semantics, default-value resolution, environment-variable overrides). Several findings against a small new package are a strong signal that the underlying design is reaching for an external contract; suggest the existing library as the alternative even when the immediate bugs are individually small.
Repo context updates
- *
pkg/controllers/app/engine/controllers/uses scopedlog := logf.FromContext(ctx)withlog.Error(err, msg, kv...)for error sites andlog.Info(msg, kv...)for normal events. Code that runs from goroutines launched byReconcile/SetupWithManagershould capture a derived logger (logf.FromContext(parentCtx).WithName("…")) at goroutine launch rather than fall back to the globallogf.Log, which is unscoped and emits at info-level only.* Flag any newlogf.Log.Info(..., "error", err)pattern in this directory.