Deep Review: 20260416-144802-pr-320

Date2026-04-16 15:14
Reporancher-sandbox/rancher-desktop-daemon
Round1
Author@Nino-K
PR#320 — Manage Docker CLI context on ContainerEngineReady transitions
Branchdocker-context
Commits001fd95 Manage Docker CLI context on ContainerEngineReady transitions
ef5114d Add bats to support the docker context
ReviewersClaude Opus 4.7, Codex GPT 5.4, Gemini 3.1 Pro
VerdictRework — three serious correctness bugs in the probe and config-write paths must be fixed before merge.
Wall-clock time30 min 15 s


Executive Summary

PR 320 wires Docker CLI context creation/teardown into the engine reconciler: on ContainerEngineReady=True it writes a rancher-desktop-<suffix> context pointing at instance.DockerSocket(), and a background probe promotes that context to currentContext if the user's existing context is missing or unreachable. On the way back to False, the probe is cancelled and the context is removed.

The design (probe in a goroutine derived from watcherCtx, generation counter to deduplicate concurrent probes) is sensible, but three correctness bugs make this PR unsafe to ship as-is, and all three were identified independently by all reviewers:

  1. Probe-vs-cleanup race resurrects a deleted context. The goroutine does not check probeCtx.Err() before writing currentContext, so a cancelled probe interprets cancellation as "current context is unhealthy" and writes our context name to ~/.docker/config.json after removeDockerContext has already cleared it and deleted the meta dir. The user's docker CLI then fails with "Current context … is not found" until they manually run docker context use default.
  2. Hardcoded unix:// breaks every non-Unix Docker context. The probe strips unix:// from the user's existing context Host and unconditionally re-prepends it before dialing, so tcp://, ssh://, and npipe:// contexts probe as unix://tcp://…, get marked unhealthy, and silently get overridden — Windows users and remote-Docker users lose their working context.
  3. Non-atomic, repeated writes to ~/.docker/config.json risk data loss. os.WriteFile truncates before writing, and removeDockerContext runs on every reconcile in the stopped state, so the user's auths/credsStore can be lost to a partial write or a race with a concurrent docker login.

Two further important issues complete the must-fix list: the new code logs errors via logf.Log.Info(..., "error", err) instead of the scoped log.Error(...) pattern used everywhere else in the reconciler, and the hardcoded unix:// scheme in createReplaceDockerContext makes the context Docker CLI on Windows cannot open. Reviewer @mook-as raised the same Windows concern in inline review comments that remain unresolved.

The BATS tests mutate the developer's actual ~/.docker/config.json — they should run against an isolated HOME (or DOCKER_CONFIG) so a mid-test failure does not corrupt local Docker state. None of the reviewer's prior-round comments have been addressed.

Structure: the change is small (4 files, ~559 LoC) but split into one "code" commit and one "tests" commit; squash before merge.


Critical Issues

None.


Important Issues

I1. Probe goroutine ignores cancellation and resurrects the deleted context Claude Codex Gemini
	if err := r.deleteAllOfType(ctx, &containersv1alpha1.ContainerList{}); err != nil {
		errs = append(errs, fmt.Errorf("failed to delete Containers: %w", err))
	}
	if err := r.deleteAllOfType(ctx, &containersv1alpha1.VolumeList{}); err != nil {
		errs = append(errs, fmt.Errorf("failed to delete Volumes: %w", err))
	}
	if err := r.deleteAllOfType(ctx, &containersv1alpha1.ImageList{}); err != nil {
		errs = append(errs, fmt.Errorf("failed to delete Images: %w", err))
	}
	if err := r.deleteAllOfType(ctx, &containersv1alpha1.ContainerNamespaceList{}); err != nil {
		errs = append(errs, fmt.Errorf("failed to delete ContainerNamespaces: %w", err))
	}
	return errors.Join(errs...)
}

// deleteAllOfType lists and removes every resource of the given kind
// in r.apiNamespace. Engine is authoritative for Container, Image, and
// Volume in the App's namespace: this loop deletes every object, not
// just engine-owned mirrors. Coexistence with another writer to these
// kinds requires an engine-owned label and matching filters here and
// in the sync_*.go full-sync prune paths. Finalizer removal retries
// on conflict; per-item errors are collected so one stuck object does
// not block the rest.
func (r *EngineReconciler) deleteAllOfType(ctx context.Context, list client.ObjectList) error {
	if err := r.List(ctx, list, client.InNamespace(r.apiNamespace)); err != nil {
		return err
	}

	items, err := meta.ExtractList(list)
	if err != nil {
		return err
	}

	var errs []error
	for _, item := range items {
		obj := item.(client.Object)
		key := client.ObjectKeyFromObject(obj)
		// meta.ExtractList leaves each item's TypeMeta empty (GVK is
		// only on the list), so format the Go type with %T instead of
		// obj.GetObjectKind().
		retryErr := retry.RetryOnConflict(retry.DefaultRetry, func() error {
			latest := obj.DeepCopyObject().(client.Object)
			if err := r.Get(ctx, key, latest); err != nil {
				if apierrors.IsNotFound(err) {

When the engine stops, removeDockerContext (line 328) cancels probeCtx, clears currentContext if it points at us, and deletes the context dir — without waiting for the probe goroutine to finish. The probe interprets the cancellation in probeDockerContext's Ping as "context is unhealthy" and proceeds to write setCurrentDockerContext(contextName). That write can land after removeDockerContext has already deleted the meta dir, leaving currentContext pointing at a non-existent context. Every subsequent docker ps/docker build/docker context inspect then fails with "Current context … is not found" until the user manually runs docker context use default.

		// only on the list), so format the Go type with %T instead of
		// obj.GetObjectKind().
		retryErr := retry.RetryOnConflict(retry.DefaultRetry, func() error {
			latest := obj.DeepCopyObject().(client.Object)
			if err := r.Get(ctx, key, latest); err != nil {
				if apierrors.IsNotFound(err) {
					return nil
				}
				return err
			}
			if !controllerutil.RemoveFinalizer(latest, mirrorFinalizer) {

The generation counter (myGen) only protects the cancel-slot bookkeeping; the disk write on line 319 is not gated on probeCtx.Err(), generation match, or any wait on the cleanup path.

Fix: gate the promotion on probeCtx.Err(). A fully race-free fix also adds a sync.WaitGroup so removeDockerContext blocks until the probe returns:

 go func() {
+    r.contextProbeWg.Add(1)
+    defer r.contextProbeWg.Done()
     defer func() {
         ...
     }()
     ...
     if !healthy {
+        if probeCtx.Err() != nil {
+            return
+        }
         if err := setCurrentDockerContext(contextName); err != nil {
             ...
         }
     }
 }()

 func (r *EngineReconciler) removeDockerContext() {
     r.contextMu.Lock()
     ...
     r.contextMu.Unlock()
+    r.contextProbeWg.Wait()
     ...
 }

The wait is bounded by dockerContextProbeTimeout (3s) because probeCtx gets cancelled before the wait. The Err() check alone closes most of the window; the WaitGroup closes the remainder where the goroutine has read !healthy and is about to call setCurrentDockerContext when cancellation fires.

I2. Hardcoded unix:// scheme breaks tcp / ssh / npipe Docker contexts Claude Codex Gemini
		return nil
	}
	return err
}

// getDockerContextSocket returns the Unix socket path that the named context
// points at. Returns an empty string if the context does not exist or has no
// docker endpoint.
func getDockerContextSocket(name string) (string, error) {
	path, err := contextMetaPath(name)
	if err != nil {
		return "", err
	}
	data, err := os.ReadFile(path)
	if errors.Is(err, os.ErrNotExist) {
		return "", nil
	}
	if err != nil {
		return "", err
	}
	var meta dockerContextMeta
	if err := json.Unmarshal(data, &meta); err != nil {
		return "", err
	}
	host := meta.Endpoints["docker"].Host
	return strings.TrimPrefix(host, "unix://"), nil
}

// getCurrentDockerContext reads the currentContext field from ~/.docker/config.json.
// Returns an empty string if the file does not exist or no context is set.
func getCurrentDockerContext() (string, error) {
	dir, err := dockerConfigDir()

If the user's current context has Host: "tcp://foo:2375" (or ssh://, npipe://), getDockerContextSocket returns the string unchanged because the "unix://" prefix doesn't match. probeDockerContext then constructs "unix://tcp://foo:2375", the moby client parses it as proto=unix, addr="tcp://foo:2375", attempts to dial a Unix socket at the literal path, and fails. The probe reports healthy=false, and the goroutine silently overrides the user's working remote/Windows-pipe context with ours.

This violates the PR's documented contract ("if the current context is healthy, leave it alone") for every non-Unix endpoint and is the same class of bug Windows users will hit in I4.

Fix: stop stripping the scheme — pass the full Host URL through to the client. Rename for clarity:

-func getDockerContextSocket(name string) (string, error) {
+// getDockerContextHost returns the Docker Host URL (e.g. "unix:///path",
+// "tcp://host:port") that the named context points at.
+func getDockerContextHost(name string) (string, error) {
     ...
-    host := meta.Endpoints["docker"].Host
-    return strings.TrimPrefix(host, "unix://"), nil
+    return meta.Endpoints["docker"].Host, nil
 }

-func probeDockerContext(ctx context.Context, socketPath string) bool {
+func probeDockerContext(ctx context.Context, host string) bool {
     ...
-    cli, err := dockerclient.New(dockerclient.WithHost("unix://" + socketPath))
+    cli, err := dockerclient.New(dockerclient.WithHost(host))
     ...
 }
I3. Non-atomic write of ~/.docker/config.json can lose user data Gemini
}

// updateDockerConfig reads the Docker config file at path, applies mutate,
// and writes the result back. All other keys in the file are preserved.
// If the file does not exist and mutate leaves cfg empty, no file is written.
func updateDockerConfig(path string, mutate func(map[string]any)) error {
	cfg := map[string]any{}
	data, err := os.ReadFile(path)
	notExist := errors.Is(err, os.ErrNotExist)
	switch {
	case err == nil:
		if jsonErr := json.Unmarshal(data, &cfg); jsonErr != nil {
			return jsonErr
		}
	case notExist:
	// we should create the file below if needed
	default:
		return err
	}

	before := len(cfg)
	mutate(cfg)

	// file didn't exist and there is nothing left nothing to write.
	if notExist && len(cfg) == 0 && before == 0 {
		return nil
	}

	out, err := json.MarshalIndent(cfg, "", "\t")
	if err != nil {
		return err
	}
	if err := os.MkdirAll(filepath.Dir(path), 0o700); err != nil {
		return err
	}
	return os.WriteFile(path, append(out, '\n'), 0o600)
}

updateDockerConfig uses os.WriteFile, which opens with O_TRUNC and writes in place. Two failure modes:

  1. Crash after truncate, before write completes: The user's ~/.docker/config.json is left empty or partial, losing every key it held (auths, credsStore, currentContext, plugin settings, etc.).
  2. Race with concurrent docker CLI: A docker login or docker context use running concurrently performs its own read-modify-write. Whichever write lands second silently clobbers the other's changes — the user loses an auth entry or a context switch with no error.

Severity is amplified by the design: removeDockerContext is called on every reconcile when the engine is stopped (see S1), so the file is read, re-marshalled, and rewritten on a steady cadence in the background even when nothing has changed.

Fix: write to a sibling temp file and rename(2) — atomic on POSIX, and fixable on Windows by removing the destination first. Also short-circuit when mutate produces no change so steady state pays no I/O cost:

-func updateDockerConfig(path string, mutate func(map[string]any)) error {
+func updateDockerConfig(path string, mutate func(map[string]any) bool) error {
     cfg := map[string]any{}
     ...
-    before := len(cfg)
-    mutate(cfg)
-
-    if notExist && len(cfg) == 0 && before == 0 {
+    if changed := mutate(cfg); !changed {
         return nil
     }

     out, err := json.MarshalIndent(cfg, "", "\t")
     ...
-    return os.WriteFile(path, append(out, '\n'), 0o600)
+    tmp, err := os.CreateTemp(filepath.Dir(path), ".config.json.*")
+    if err != nil {
+        return err
+    }
+    if _, err := tmp.Write(append(out, '\n')); err != nil {
+        tmp.Close()
+        os.Remove(tmp.Name())
+        return err
+    }
+    if err := tmp.Chmod(0o600); err != nil {
+        tmp.Close()
+        os.Remove(tmp.Name())
+        return err
+    }
+    if err := tmp.Close(); err != nil {
+        os.Remove(tmp.Name())
+        return err
+    }
+    return os.Rename(tmp.Name(), path)
 }

The mutate callback returns bool so the caller can signal whether it actually changed anything — setCurrentDockerContext returns true only when the field changes, clearCurrentDockerContext only when the matching key existed. This collapses I3 with S1 (no rewrite-on-every-reconcile).

I4. Windows: hardcoded unix:// scheme produces a context Docker CLI on Windows cannot open Claude
		return err
	}
	if err := os.MkdirAll(filepath.Dir(path), 0o700); err != nil {
		return err
	}
	meta := dockerContextMeta{
		Name:     name,
		Metadata: map[string]any{},
		Endpoints: map[string]dockerEndpointMeta{
			"docker": {Host: "unix://" + socketPath},
		},
	}
	data, err := json.MarshalIndent(meta, "", "\t")
	if err != nil {
		return err
	}

instance.DockerSocket() returns a Windows path like %USERPROFILE%\.rd<suffix>\docker.sock; the resulting meta.json contains Host: "unix://C:\\Users\\…\\docker.sock", which Docker CLI on Windows cannot use (it expects npipe://./pipe/...). The BATS file skips Windows at line 15, but manageDockerContext runs unconditionally in production, so the controller will write a broken context for every Windows user. Reviewer @mook-as raised the same concern at discussion_r3096484641, which is unresolved.

	"encoding/hex"
	"encoding/json"
	"errors"
	"os"
	"path/filepath"
	"strings"
	"time"

	dockerclient "github.com/moby/moby/client"
)

Fix (smallest): gate manageDockerContext and removeDockerContext on runtime.GOOS != "windows" until the named-pipe path is wired up. Long-term: switch to github.com/docker/cli/cli/context/store, which constructs the platform-appropriate Host (see Design Concerns).

 func (r *EngineReconciler) manageDockerContext(socketPath string) {
+    if runtime.GOOS == "windows" {
+        return
+    }
     contextName := instance.Name()
     ...
 }

Apply the same guard to removeDockerContext.

I5. New error sites use logf.Log.Info(..., "error", err) instead of the scoped log.Error(...) pattern Claude
		if !changed {
			return nil
		}
		return r.Status().Update(ctx, latest)
	})
}

// cleanupMirrorResources removes every mirror resource the engine
// controller owns. Errors from all four kinds are joined so one stuck
// type does not block the rest.
func (r *EngineReconciler) cleanupMirrorResources(ctx context.Context) error {

Every other error site in this package uses log := logf.FromContext(ctx); log.Error(err, msg, kv...). The new code deviates on two axes:

  1. .Info logs at info severity with an "error" field, so operators filtering on error level miss these and structured-log parsers do not promote err into a dedicated field.
  2. logf.Log is the unscoped global logger. SetupWithManager primes r.watcherCtx with logf.IntoContext(..., mgr.GetLogger().WithName("engine-watcher")), but the new code ignores it — these lines have no controller name and no goroutine identity.

Fix: capture a scoped logger at goroutine launch and use .Error:

 func (r *EngineReconciler) manageDockerContext(socketPath string) {
     contextName := instance.Name()
+    log := logf.FromContext(r.watcherCtx).WithName("docker-context")

     if err := createReplaceDockerContext(contextName, socketPath); err != nil {
-        logf.Log.Info("Failed to create Docker context", "context", contextName, "error", err)
+        log.Error(err, "Failed to create Docker context", "context", contextName)
         return
     }
     ...
 }

Apply the same pattern to removeDockerContext.


Suggestions

S1. removeDockerContext re-runs config.json I/O on every stopped-state reconcile Claude
			// expect Container/Image/Volume mirrors must gate on
			// Reason, not Status alone. The condition will be renamed
			// when containerd mirroring lands.
			terminalReason = "NotApplicable"
			terminalStatus = metav1.ConditionTrue
			terminalMessage = "Engine mirroring is only supported with the moby backend"
		}
		current := meta.FindStatusCondition(app.Status.Conditions, appv1alpha1.AppConditionContainerEngineReady)
		alreadyClean := !watcherDied && current != nil && current.Reason == terminalReason && current.Status == terminalStatus
		if !alreadyClean {
			if err := r.cleanupMirrorResources(ctx); err != nil {
				log.Error(err, "Failed to clean up mirror resources")
				return ctrl.Result{}, err
			}
		}
		return ctrl.Result{}, r.setEngineCondition(ctx, &app, terminalStatus, terminalReason, terminalMessage)
	}

	if !watcherRunning {
		log.Info("App is running, starting Docker watcher")
		if err := r.startWatcherAndSync(ctx); err != nil {

The else branch fires every reconcile while the engine is stopped, so removeDockerContext → clearCurrentDockerContext → updateDockerConfig reads, re-marshals, and rewrites ~/.docker/config.json continuously even in steady state. With the current non-atomic write (I3), this turns a single stale-config-cleanup into ongoing risk. Stable state should be no-I/O.

Fix: move removeDockerContext() inside the !alreadyClean block, or fold it into the same condition that gates cleanupMirrorResources. With I3's "no rewrite when nothing changed" path, the cost goes away automatically.

S2. "default" context is unconditionally overridden without a probe Claude Codex
// just engine-owned mirrors. Coexistence with another writer to these
// kinds requires an engine-owned label and matching filters here and
// in the sync_*.go full-sync prune paths. Finalizer removal retries
// on conflict; per-item errors are collected so one stuck object does
// not block the rest.
func (r *EngineReconciler) deleteAllOfType(ctx context.Context, list client.ObjectList) error {
	if err := r.List(ctx, list, client.InNamespace(r.apiNamespace)); err != nil {
		return err
	}

	items, err := meta.ExtractList(list)
	if err != nil {
		return err
	}

	var errs []error
	for _, item := range items {
		obj := item.(client.Object)
		key := client.ObjectKeyFromObject(obj)
		// meta.ExtractList leaves each item's TypeMeta empty (GVK is

The comment claims "default" and "" both mean "no working context", but default is a real Docker context — on Linux with Docker Engine installed it points to /var/run/docker.sock, which may be running. The code treats it as broken and silently promotes ours. The PR's stated contract ("if the current context is healthy, leave it alone") would be violated on every Linux box with both Docker Engine and Rancher Desktop installed.

Fix: probe default explicitly via dockerclient.DefaultDockerHost (or client.FromEnv) instead of skipping it. The change is small once I2's "pass the full Host through" refactor is in place.

S3. DOCKER_CONFIG environment variable is ignored Claude
type dockerEndpointMeta struct {
	Host          string `json:"Host"`
	SkipTLSVerify bool   `json:"SkipTLSVerify"`
}

func dockerConfigDir() (string, error) {
	home, err := os.UserHomeDir()
	if err != nil {
		return "", err
	}
	return filepath.Join(home, ".docker"), nil
}

// contextMetaPath returns the path to the meta.json file for the named context.
// Docker creates the directory name using sha256(contextName).
func contextMetaPath(name string) (string, error) {
	dir, err := dockerConfigDir()

Docker CLI honours $DOCKER_CONFIG to relocate the config dir. Users who set it (common in dev containers, CI runners, or Nix profiles) will have the context written under ~/.docker while their CLI reads from $DOCKER_CONFIG, so the new context never appears.

 func dockerConfigDir() (string, error) {
+    if dir := os.Getenv("DOCKER_CONFIG"); dir != "" {
+        return dir, nil
+    }
     home, err := os.UserHomeDir()
     if err != nil {
         return "", err
     }
     return filepath.Join(home, ".docker"), nil
 }
S4. Empty docker endpoint is treated as "no current context" Claude
}

// getDockerContextSocket returns the Unix socket path that the named context
// points at. Returns an empty string if the context does not exist or has no
// docker endpoint.
func getDockerContextSocket(name string) (string, error) {
	path, err := contextMetaPath(name)
	if err != nil {
		return "", err
	}
	data, err := os.ReadFile(path)
	if errors.Is(err, os.ErrNotExist) {
		return "", nil
	}
	if err != nil {
		return "", err
	}
	var meta dockerContextMeta
	if err := json.Unmarshal(data, &meta); err != nil {
		return "", err
	}
	host := meta.Endpoints["docker"].Host
	return strings.TrimPrefix(host, "unix://"), nil
}

// getCurrentDockerContext reads the currentContext field from ~/.docker/config.json.
// Returns an empty string if the file does not exist or no context is set.
func getCurrentDockerContext() (string, error) {
	dir, err := dockerConfigDir()

If meta.json exists but has no docker endpoint (a kubernetes-only context, which Docker CLI does write), the map lookup returns a zero dockerEndpointMeta{} with Host="". The caller then takes the currentSocket != "" false branch, treats healthy=false, and overrides a context the user is actively using — just one without a docker endpoint.

Fix shares the I2 refactor: return the raw Host and let the caller distinguish "" → not probeable, leave alone from currentSocket missing → promote ours.

S5. BATS tests modify the developer's real ~/.docker/config.json without HOME isolation Claude Codex Gemini
    docker rm --force test-restart
}

# --- Cleanup on shutdown ---

@test "stopping VM removes all mirror resources" {
    # Make sure we have at least one resource to verify cleanup.
    rdd ctl wait --for=create --namespace="${RDD_NAMESPACE}" \
        ContainerNamespace/moby --timeout=10s

    rdd set running=false

    run -0 rdd ctl get containers --namespace="${RDD_NAMESPACE}" --output=name
    refute_output
    run -0 rdd ctl get images --namespace="${RDD_NAMESPACE}" --output=name
    refute_output
    run -0 rdd ctl get volumes --namespace="${RDD_NAMESPACE}" --output=name
    refute_output
    run -0 rdd ctl get ContainerNamespaces --namespace="${RDD_NAMESPACE}" --output=name
    refute_output
}

@test "rdd set running=false returns promptly when already stopped" {
    # rdd set running=false waits for the App's Running condition to
    # go to False, not for ContainerEngineReady. When the VM is
    # already stopped, the wait must complete immediately rather than
    # hang on the already-False engine condition.
    rdd set --timeout=10s running=false
}

@test "VM start recreates ContainerNamespace/moby after cleanup" {
    # The "stopping VM removes all mirror resources" test above swept
    # ContainerNamespace/moby along with the rest of the mirrors, so
    # restarting the VM must recreate it. This bridges the teardown
    # above to the ContainerNamespace delete test below, which needs
    # the namespace to exist before it can delete it.
    rdd set running=true
    rdd ctl wait --for=create --namespace="${RDD_NAMESPACE}" \
        ContainerNamespace/moby --timeout=10s
}

@test "deleting ContainerNamespace/moby completes without a finalizer hang" {
    # moby ContainerNamespace has no mirror finalizer, so a user delete
    # must return promptly rather than get trapped in Terminating.
    rdd ctl delete ContainerNamespace/moby --namespace="${RDD_NAMESPACE}" --timeout=10s
    run -0 rdd ctl get ContainerNamespaces --namespace="${RDD_NAMESPACE}" --output=name
    refute_output
}

# --- containerd backend ---

@test "containerd backend reports ContainerEngineReady=NotApplicable and skips mirroring" {
    # Stop first so there is no stale True/Connected from moby to
    # satisfy the CLI wait below before the engine reconciler has run.
    rdd set running=false

    # Start with containerd. rdd set waits for ContainerEngineReady=True,
    # which the engine reconciler satisfies immediately with reason
    # NotApplicable because engine mirroring only supports the moby
    # backend.
    rdd set containerEngine.name=containerd running=true

    run -0 rdd ctl get app app \
        -o jsonpath='{.status.conditions[?(@.type=="ContainerEngineReady")].reason}'
    assert_output "NotApplicable"

    # No mirror resources should exist in containerd mode.
    run -0 rdd ctl get containers --namespace="${RDD_NAMESPACE}" --output=name
    refute_output
    run -0 rdd ctl get images --namespace="${RDD_NAMESPACE}" --output=name
    refute_output
    run -0 rdd ctl get volumes --namespace="${RDD_NAMESPACE}" --output=name
    refute_output
    run -0 rdd ctl get ContainerNamespaces --namespace="${RDD_NAMESPACE}" --output=name
    refute_output
}

The tests operate on the developer's actual $HOME/.docker/config.json (no helper overrides HOME). The "restore in teardown" inline block at lines 471–475 runs only when execution reaches it — any assert_output failure above it leaves currentContext cleared. There is no file-level teardown.

Fix options, in order of preference:

  1. Override HOME (or DOCKER_CONFIG, after S3 lands) in local_setup_file to a per-test temp dir; the engine controller picks it up when started by the test harness.
  2. Move the save/restore into a BATS teardown so it runs unconditionally.
S6. $(rdd ...) and $(jq ...) command substitution in BATS Claude Codex Gemini
    refute_output
    run -0 rdd ctl get ContainerNamespaces --namespace="${RDD_NAMESPACE}" --output=name
    refute_output
}

@test "rdd set running=false returns promptly when already stopped" {
    # rdd set running=false waits for the App's Running condition to
    # go to False, not for ContainerEngineReady. When the VM is
    # already stopped, the wait must complete immediately rather than
    # hang on the already-False engine condition.
    rdd set --timeout=10s running=false

Repo convention forbids $(cmd) substitution when the result feeds an assertion — failures inside $(...) produce opaque BATS errors with no output. Use run -0 <cmd>; var=$output. Reviewer @mook-as flagged the same line at discussion_r3096228639.

S7. Typo "currentContextshouldhavebeenremoved" in failure message Claude
		data, err := os.ReadFile(p.configFile)
		assert.NilError(t, err)
		var cfg map[string]any
		assert.NilError(t, json.Unmarshal(data, &cfg))
		_, hasCtx := cfg["currentContext"]
		assert.Assert(t, !hasCtx, "currentContextshouldhavebeenremoved")
		_, hasAuths := cfg["auths"]
		assert.Assert(t, hasAuths, "auths key must be preserved")
	})
}
-assert.Assert(t, !hasCtx, "currentContextshouldhavebeenremoved")
+assert.Assert(t, !hasCtx, "currentContext should have been removed")
S8. No integration test exercises the healthy-context-skip path Claude Codex Gemini

The probe's central responsibility — "if the user's context is healthy, leave it alone" — has no integration test. Existing tests cover only "unset → promote ours" and "stop → remove". A test that stages a healthy alternative context (e.g. a second context name pointing at the same socket, set as current) and asserts currentContext is unchanged after the engine starts would catch I2 regressions and document the policy.

S9. newDockerTestDir returns contextsDir that no test reads Claude

	"gotest.tools/v3/assert"
)

// dockerTestPaths holds the config and contexts directory for a test.
type dockerTestPaths struct {
	configFile  string
	contextsDir string
}

// newDockerTestDir creates a temp ~/.docker layout and returns its paths.
func newDockerTestDir(t *testing.T) dockerTestPaths {
	t.Helper()
	root := t.TempDir()
	dockerDir := filepath.Join(root, ".docker")
	assert.NilError(t, os.MkdirAll(dockerDir, 0o700))
	return dockerTestPaths{
		configFile:  filepath.Join(dockerDir, "config.json"),
		contextsDir: filepath.Join(dockerDir, "contexts"),
	}
}

func Test_updateDockerConfig_set(t *testing.T) {
	t.Parallel()

	t.Run("creates file when absent", func(t *testing.T) {

contextsDir is assigned but never referenced by any test in the file. Drop the field, or convert it to a helper returning the meta-path that removes the t.Setenv("HOME", …) gymnastics in the create/delete tests.


Design Observations

Concerns

Strengths


Testing Assessment

The new unit tests cover the file I/O helpers in isolation (updateDockerConfig, createReplaceDockerContext, getCurrentDockerContext) and the BATS additions exercise the create-on-start and remove-on-stop end-to-end paths. The gaps are concentrated in the probe goroutine and the config-file integration:

  1. Probe-vs-cleanup race (I1) — no test triggers a stop while a probe is in flight; a test that stages a slow probe (e.g. via a context pointing at a socket that hangs) and then sets running=false would reproduce the resurrection bug.
  2. Non-Unix current context (I2) — no test stages Host: "tcp://…" or npipe://… in currentContext. The probe behavior with those schemes is unverified.
  3. Healthy-context-skip (S8) — the central policy is untested.
  4. default context interaction (S2) — no test of a healthy /var/run/docker.sock (the realistic Linux scenario) being preserved.
  5. Crash-during-write (I3) — no test for the partial-write recovery, though once os.Rename-based atomicity is in place this becomes structurally impossible.
  6. Repeated engine restartsmanageDockerContext runs once per start in tests, never twice; the cancel-on-supersede path is unexercised.
  7. Windows (I4) — entirely skipped via skip_on_windows; production code runs unconditionally there.

Documentation Assessment


Commit Structure

Two commits: 001fd95 (code) + ef5114d (tests). The tests verify the code in the preceding commit; squashing into one commit is conventional for this repo and keeps git bisect honest. Not blocking.


Unresolved Feedback

Eight inline review comments from @mook-as remain unresolved on the current diff:


Agent Performance Retro

[Claude] Produced the most thorough, finely-segmented review: separated the unix:// scheme bug from the Windows-specific Host construction (I2 vs I4), the "default" override (S2), the DOCKER_CONFIG env-var miss (S3), the empty-endpoint case (S4), the per-reconcile rewrite (S1), and the unscoped logf.Log logging (I5) — none of which Codex or Gemini surfaced individually. The cancel-race fix it proposed was the most defensive (gen-check + ctx-check + WaitGroup); the others stopped at the ctx.Err() check alone, which closes most but not all of the window. Verdict line was the only one to call out Windows production behavior as a blocker.

[Codex] Concise and correct. Caught the cancel-race and the unix:// bug independently, and explicitly bundled the "default" override into the same finding rather than separating it — a defensible call. Did not surface the non-atomic write, the Windows compile-but-broken issue, or the logging-style deviation. The fix proposals were the smallest possible diffs; useful when the priority is "minimum surface to land the fix."

[Gemini] Caught the same two consensus bugs as the others and uniquely flagged the non-atomic os.WriteFile data-loss path (I3) — the only finding no other agent raised. Severity calibration was hot: marked all three findings as Critical, where the consolidator downgraded to Important (recoverable in all three cases without data permanently lost on the median user system). The atomic-write framing was the most actionable, with a concrete tmp+rename patch.

Claude Opus 4.7Codex GPT 5.4Gemini 3.1 Pro
Duration18m 27s3m 34s3m 02s
Findings5I 9S2I 1S3I
Tool calls47 (Grep 19, Bash 14, Read 9)21 (shell 20, stdin 1)7 (grepsearch 3, runshellcommand 3, readfile 1)
Design observations531
False positives000
Unique insights501
Files reviewed444
Coverage misses000
Totals5I 9S2I 1S3I
Downgraded003 (I→S)
Dropped000

Reconciliation


Review Process Notes

Skill improvements

Repo context updates