Deep Review: 20260421-225153-pr-336

Date2026-04-21 22:51
Reporancher-sandbox/rancher-desktop-daemon
Round2
Author@jandubois
PR#336 — cli: standardize --wait and --timeout across svc and limavm
Branchwait-timeout-cleanup
Commits8aa55e4 cli: address deep-review findings for #336
99aa4e3 cli: standardize --wait and --timeout across svc and limavm
ReviewersClaude Opus 4.7 (effort: xhigh), Codex GPT 5.4 (effort: high), Gemini 3.1 Pro (effort: default)
VerdictMerge with minor fixes — Round 1's Important findings (I1 delete-under-live-process race, I2 shared default timeout) are correctly resolved in 8aa55e4. One new Important gap remains: ensureServiceRunning timeouts still exit with code 1 instead of 4, so the advertised uniform exit-code-4 contract is not uniform for the service-autostart path invoked from rdd set, rdd limavm *, and rdd kubectl. The rest is documentation polish and still-missing BATS coverage for the new svc * timeout paths.
Wall-clock time2 h 4 min 21 s


Executive Summary

8aa55e4 correctly addresses the Round 1 Important findings: --wait=false is removed from svc delete (the directory-removal-under-live-process race is gone), the single defaultWaitTimeout = 30s is replaced with two named constants (startWaitTimeout = 90s, stopWaitTimeout = limaLongWaitTimeout = 5m) each anchored to a concrete invariant in its doc comment, and the SIGTERM/SIGINT signal-name mismatch in docs/design/cmd_service.md is fixed. docs/design/cmd_lima.md is now consistent with the code, and two new BATS tests cover lima create --start --timeout and lima restart --timeout exit-code-4 paths.

The remaining items split cleanly. One Important gap: ensureServiceRunning (called by rdd set, every rdd limavm * path via getKubeClient, and rdd kubectl) returns raw context.DeadlineExceeded without routing through timeoutError, so a service-start timeout during those commands exits with code 1 instead of the advertised code 4. The rest is documentation polish: the PR description and commit body still describe pre-8aa55e4 defaults, docs/design/api_service.md still says svc stop delivers SIGTERM (the companion doc cmd_service.md was fixed but not this one), and docs/design/cmd_service.md's svc delete section lacks a cross-reference to svc stop for the signal/fallback details. The new svc start/svc stop/svc delete timeout paths still have no BATS coverage — 8aa55e4's new tests cover the lima paths only, so Round 1's I3 is partially addressed but leaves the svc * side of the contract unasserted.

A small Delete TOCTOU (self-heals on retry) and a subtle readiness-contract gap on the documented --wait=falserdd svc start workflow round out the suggestion list.


Critical Issues

None.


Important Issues

I1. ensureServiceRunning timeouts bypass the cliexit.Timeout envelope Codex Gemini
		RunE:  serviceConfigAction,
	}
	return command
}

func ensureServiceRunning(ctx context.Context) error {
	if !service.Exists() {
		if err := service.Create(nil); err != nil {
			return err
		}
	}
	if service.Running() {
		ctx, cancel := context.WithTimeout(ctx, startWaitTimeout)
		defer cancel()
		if err := service.Wait(ctx); err != nil {
			return err
		}
		return waitForDiscoveryConfigMap(ctx)
	}
	return startAndWaitForReady(ctx, nil, startWaitTimeout)
}

// startAndWaitForReady starts the service, waits for the API server and the
// discovery ConfigMap to be ready. After an unclean shutdown the ConfigMap may
// survive with stale data, so the freshness check waits for a ConfigMap whose
// creationTimestamp is at or after the current startup.

ensureServiceRunning now enforces the new startWaitTimeout (90s) on every service-autostart path, but it returns raw context.DeadlineExceeded without routing through timeoutError. Callers — getKubeClient in cmd/rdd/limavm.go:261 (used by every rdd limavm * subcommand), rdd set at cmd/rdd/set.go:110 and :566, rdd kubectl at cmd/rdd/kubectl.go:34, and rdd service config at cmd/rdd/service.go:182 — all propagate the error without wrapping. cmd/rdd/main.go's errors.As(err, &cliexit.Error) check then sees a bare deadline error and exits with code 1, contradicting the PR's stated uniform exit-code-4 contract. The svc start path itself is fine (line 302 wraps startAndWaitForReady directly with timeoutError), but the shared ensureServiceRunning entry point five other commands use is not.

}

// getKubeClient returns a controller-runtime client configured for the RDD control plane.
// The returned client supports Watch for event-driven waiting on resource changes.
func getKubeClient(ctx context.Context) (client.WithWatch, error) {
	if err := ensureServiceRunning(ctx); err != nil {
		return nil, err
	}
	config, err := service.GetKubeRestConfig()
	if err != nil {
		return nil, fmt.Errorf("failed to get kubeconfig: %w", err)
}

// fetchPropertyHelp starts the service if needed, fetches the CRD schema, and
// returns a formatted list of available properties.
func fetchPropertyHelp(ctx context.Context) string {
	if err := ensureServiceRunning(ctx); err != nil {
		return ""
	}
	config, err := service.GetKubeRestConfig()
	if err != nil {
		return ""
	}
	return command
}

func ctlAction(cmd *cobra.Command, args []string) error {
	if err := ensureServiceRunning(cmd.Context()); err != nil {
		return err
	}
	if len(args) > 0 && args[0] == "wait-condition" {
		return ctlWaitConditionAction(cmd, args[1:])
	}
		return cm.Annotations[controllers.ReadyAnnotation] == "true", nil
	})
}

func serviceConfigAction(cmd *cobra.Command, _ []string) error {
	if err := ensureServiceRunning(cmd.Context()); err != nil {
		return err
	}
	contents, err := service.GetKubeconfig()
	if err != nil {
		return err
	serveArgs = append(serveArgs, "-v", logrusLevelToKlog())
	serveArgs = append(serveArgs, args...)

	if wait {
		if err := startAndWaitForReady(cmd.Context(), serveArgs, timeout); err != nil {
			return timeoutError(err)
		}
		logrus.Infof("%q control plane is ready", instance.Name())
	} else {
		if err := service.Start(cmd.Context(), serveArgs); err != nil {
			return err

Fix: wrap the error at the source so every caller inherits the contract without a separate timeoutError call at each site:

 func ensureServiceRunning(ctx context.Context) error {
     if !service.Exists() {
         if err := service.Create(nil); err != nil {
             return err
         }
     }
     if service.Running() {
         ctx, cancel := context.WithTimeout(ctx, startWaitTimeout)
         defer cancel()
-        if err := service.Wait(ctx); err != nil {
-            return err
-        }
-        return waitForDiscoveryConfigMap(ctx)
+        if err := service.Wait(ctx); err != nil {
+            return timeoutError(err)
+        }
+        return timeoutError(waitForDiscoveryConfigMap(ctx))
     }
-    return startAndWaitForReady(ctx, nil, startWaitTimeout)
+    return timeoutError(startAndWaitForReady(ctx, nil, startWaitTimeout))
 }

(important, gap — consolidated from Codex I1 and Gemini I2; both agents independently traced the unwrapped error path.)


Suggestions

S1. Delete leaves the directory in place when the process exits between outer and inner Running() checks Claude
}

// StopWithWait sends a shutdown signal to the service. When wait is true, it
// blocks until the process exits or timeout elapses; pass timeout 0 to wait
// indefinitely.
func StopWithWait(wait bool, timeout time.Duration) error {
	if !Running() {
		return fmt.Errorf("%q control plane is not running", instance.Name())
	}

	// Clean up discovery configmap while cluster is still accessible
	_ = cleanupDiscoveryConfigMap() // Clean up discovery configmap to prevent stale controller info

Pre-PR, Delete() called _ = Stop() and discarded the error, so a process that exited between any two checks did not block deletion. Post-PR, Delete(timeout) propagates the StopWithWait error verbatim. A narrow window exists between Delete's outer Running() and StopWithWait's inner one in which the serve subprocess can exit (crash, OOM-kill, external kill, concurrent rdd svc stop). When that happens, StopWithWait returns "… control plane is not running", Delete propagates it, and os.RemoveAll(instance.Dir()) never runs. The user sees exit code 1 and a confusing message, even though the desired end state is one os.RemoveAll away. The condition self-heals on retry — the next invocation sees !Running(), skips the stop branch, and proceeds to delete.

Fix: treat "already not running" as success in Delete by re-checking after the stop failure:

     if Running() {
         if err := StopWithWait(true, timeout); err != nil {
-            return err
+            // The process may have exited between our Running check and
+            // StopWithWait's; in that case its work is already done.
+            if Running() {
+                return err
+            }
         }
     }

(suggestion, regression)

S2. Exit-code-4 contract remains asserted only on lima paths; the new svc * timeout paths have no coverage Claude

    # Delete the VM
    run -0 rdd limavm delete "restart-vm"
}

@test "lima start --timeout fails when VM cannot reach desired state" {
    rdd ctl create configmap "timeout-vm" --namespace "${LIMA_TEST_NS}" \
        --from-literal="template=${TEMPLATE}"
    run_e -0 rdd limavm create "timeout-vm" "timeout-vm" --namespace "${LIMA_TEST_NS}"
    assert_created "timeout-vm" "${LIMA_TEST_NS}" "timeout-vm"

    # Non-functional template cannot boot, so --wait --timeout should fail
    # with exit code 4 (cliexit.CodeTimeout) to match `rdd set`.
    run -4 rdd limavm start --wait --timeout=3s "timeout-vm"
    assert_output --partial 'context deadline exceeded'
}

@test "lima create --start --timeout fails when VM cannot boot" {
    rdd ctl create configmap "create-timeout-vm" --namespace "${LIMA_TEST_NS}" \
        --from-literal="template=${TEMPLATE}"

    # Create with --start: the VM never reaches Running=True because the
    # template is non-functional, so --wait --timeout exits with code 4.
    run -4 rdd limavm create "create-timeout-vm" "create-timeout-vm" \
        --namespace "${LIMA_TEST_NS}" --start --wait --timeout=3s
    assert_output --partial 'context deadline exceeded'
}

@test "lima restart --timeout fails when VM cannot restart" {
    rdd ctl create configmap "restart-timeout-vm" --namespace "${LIMA_TEST_NS}" \
        --from-literal="template=${TEMPLATE}"
    run_e -0 rdd limavm create "restart-timeout-vm" "restart-timeout-vm" --namespace "${LIMA_TEST_NS}"
    assert_created "restart-timeout-vm" "${LIMA_TEST_NS}" "restart-timeout-vm"

    # Restart waits for status.restartCount to increment, which requires the
    # VM to come back up. The non-functional template never boots, so the
    # counter stays at zero and --wait --timeout exits with code 4.
    run -4 rdd limavm restart --wait --timeout=3s "restart-timeout-vm"
    assert_output --partial 'context deadline exceeded'
}

@test "lima help text is displayed" {
    # lima is an alias of the limavm command
    run -0 rdd lima --help
    assert_output --partial "LimaVM virtual machines"

8aa55e4 adds two new exit-code-4 tests on top of the one 99aa4e3 updated, all inside lima subcommands. None of the four new svc * timeout paths the PR introduces carry coverage:

			return nil
		}
		logrus.Infof("waiting for %q control plane to be ready", instance.Name())
		ctx, cancel := watchtools.ContextWithOptionalTimeout(cmd.Context(), timeout)
		defer cancel()
		return timeoutError(service.Wait(ctx))
	}

	// Collect all provided flags as arguments for serve subprocess
	var serveArgs []string
	if cmd.Flags().Changed("controllers") {
	serveArgs = append(serveArgs, "-v", logrusLevelToKlog())
	serveArgs = append(serveArgs, args...)

	if wait {
		if err := startAndWaitForReady(cmd.Context(), serveArgs, timeout); err != nil {
			return timeoutError(err)
		}
		logrus.Infof("%q control plane is ready", instance.Name())
	} else {
		if err := service.Start(cmd.Context(), serveArgs); err != nil {
			return err
	if err != nil {
		return err
	}

	if err := service.StopWithWait(wait, timeout); err != nil {
		return timeoutError(err)
	}
	if wait {
		logrus.Infof("%q control plane has stopped", instance.Name())
	} else {
		logrus.Infof("%q control plane is stopping", instance.Name())
	timeout, err := cmd.Flags().GetDuration("timeout")
	if err != nil {
		return err
	}
	if err := service.Delete(timeout); err != nil {
		return timeoutError(err)
	}
	logrus.Infof("%q control plane has been deleted", instance.Name())
	return nil
}

A regression that returns the wrapped error directly without going through timeoutError will ship green. This restates Round 1 I3; the fixup commit partially addressed it (lima paths now covered) but left the svc side unasserted. The simplest provoke for svc start is creating the control plane with a broken controller manifest; the simplest for svc stop is the setup from bats/tests/33-lima-controllers/limavm-running.bats:327 ("graceful service stop terminates hostagent"). At minimum, add one svc * timeout assertion so the contract is not lima-only.


# Graceful shutdown test
# Verifies that stopping the service terminates all running hostagents.
# The controller's shutdown runnable calls shutdownAllHostagents() on exit.

@test "graceful service stop terminates hostagent" {
    local ha_pid
    ha_pid=$(get_hostagent_pid "${VM_NAME}")
    assert [ -n "${ha_pid}" ]
    assert_process_alive "${ha_pid}"

(suggestion, gap — restates Round 1 I3 partially addressed.)

S3. rdd svc start on an already-running service returns before the ReadyAnnotation is set Claude Codex
		if err := service.Create(args); err != nil {
			return err
		}
		logrus.Infof("successfully created %q control plane", instance.Name())
	}
	if service.Running() {
		logrus.Infof("%q control plane is already running", instance.Name())
		if !wait {
			return nil
		}
		logrus.Infof("waiting for %q control plane to be ready", instance.Name())
		ctx, cancel := watchtools.ContextWithOptionalTimeout(cmd.Context(), timeout)
		defer cancel()
		return timeoutError(service.Wait(ctx))
	}

	// Collect all provided flags as arguments for serve subprocess
	var serveArgs []string
	if cmd.Flags().Changed("controllers") {

docs/design/cmd_service.md:93-95 explicitly advertises the flow "--wait=false … then run rdd service start again with no options to wait." The cold-start path (startAndWaitForReady) waits for both service.Wait(ctx) and waitForFreshDiscoveryConfigMap(ctx, beforeStart) — the latter requires ReadyAnnotation == "true", which the controller manager sets only after every expected controller has registered. The already-running path calls only service.Wait(ctx), which via checkReadiness returns success as soon as the discovery ConfigMap lists registered controllers — before the ready annotation. A concurrent --wait=false followed by rdd svc start during a still-initializing daemon therefore returns success under a narrower contract than the cold-start path guarantees.


Runs `rdd service serve` to actually start the control plane in the background

Additional options:

*   `--wait=false`
    Return immediately, without waiting for all controllers to be ready. Run `rdd service start`
    again with no options to wait.

*   `--timeout=90s`
    How long to wait for the control plane to become ready. Pass `0` to wait indefinitely.
    If the deadline expires, `rdd` exits with code 4.

Codex raised this at Important because the PR's new documentation names the workflow explicitly; Claude raised it at Suggestion because the race window is narrow (controllers finish registering quickly, and a steady-state instance is always fine). We consolidate at Suggestion: the practical exposure is a narrow mid-startup race, but the documented workflow deserves a one-line fix to deliver the advertised guarantee.

Fix: mirror ensureServiceRunning's pattern and wait for the discovery ConfigMap on the already-running branch too:

-    return timeoutError(service.Wait(ctx))
+    if err := service.Wait(ctx); err != nil {
+        return timeoutError(err)
+    }
+    return timeoutError(waitForDiscoveryConfigMap(ctx))

(suggestion, gap — pre-existing; the PR's new docs make it newly relevant.)

S4. BATS tests still pass --wait explicitly on limavm delete, which is now the default Claude
    run -0 rdd limavm stop --wait=false "test-vm"
    assert_output --partial 'stopped'
    assert_running "false"

    # Delete the VM
    run -0 rdd limavm delete --wait "test-vm"
    assert_output --partial 'deleted'
    run -1 rdd ctl get limavm "test-vm" --namespace "${LIMA_TEST_NS}"
    assert_output --partial "not found"
}

limavm delete now defaults to --wait=true (cmd/rdd/limavm.go:253). The explicit flag is redundant and signals to a future reader that the flag is load-bearing — when the opposite is now true. Drop the flag:

		Args:  cobra.ExactArgs(1),
		RunE: func(cmd *cobra.Command, args []string) error {
			return limaVMDeleteAction(cmd.Context(), args[0], wait, timeout)
		},
	}
	command.Flags().BoolVar(&wait, "wait", true, "Wait for the VM to be deleted before returning")
	command.Flags().DurationVar(&timeout, "timeout", limaLongWaitTimeout, "Timeout for --wait (0 means wait indefinitely)")
	return command
}

// getKubeClient returns a controller-runtime client configured for the RDD control plane.
-run -0 rdd limavm delete --wait "test-vm"
+run -0 rdd limavm delete "test-vm"

(suggestion, gap)

S5. PR description and 99aa4e3 commit body describe the pre-8aa55e4 default timeouts Claude

The PR description lists:

> Default timeouts: > - 30s: svc start, svc stop, svc delete, limavm create > - 5m: limavm start, limavm stop, limavm restart, limavm delete

After 8aa55e4:

The "30s bucket" no longer exists. Update the PR description, and either amend 99aa4e3 or extend 8aa55e4's body with the final contract so a future git log reader sees the shipped defaults. This also picks up Round 1 S7 (the service.Stop() / service.WaitWithTimeout() removals are still unnamed in either commit body) and Round 1 S8 (the stale limavm create 30s line).

(suggestion, gap — re-raises Round 1 S7 and S8 unaddressed.)

S6. docs/design/cmd_service.md does not cross-reference svc delete back to svc stop Claude
`--wait=false` to return immediately, or `--timeout=0` to wait indefinitely. If the
deadline expires, `rdd` exits with code 4. The default matches the per-VM ceiling
because graceful shutdown will eventually stop every running LimaVM before the
service exits.

### `rdd service delete`

Stops the control plane and removes the instance directory, the short directory
(which contains the Lima home), and — unless `RDD_KEEP_LOGS` is set — the log
directory.

Delete always waits for the control plane to exit before removing files, because
removing the instance directory under a live process corrupts it on Windows and
breaks PID-file mutual exclusion on Unix. Use `--timeout` to bound that wait
(default `5m`); `0` waits indefinitely. If the deadline expires, `rdd` exits
with code 4 and the directory is left in place.

### `rdd service reset`

Deletes the datastore, but create a new (empty) one with the same settings.

The 99aa4e3 commit body says the change folds the shutdown-wait semantics of svc delete into a reference to svc stop, but the actual doc carries no cross-reference. A reader must know that svc delete uses svc stop's mechanism to understand which signal is sent (SIGINT / CTRLBREAKEVENT, with Kill fallback). A one-line "See svc stop for the signal/fallback details" closes the gap without duplicating content.

(suggestion, gap)

S7. docs/design/api_service.md still describes svc stop as delivering SIGTERM Codex

* It starts any additional controllers configured via the `config` ConfigMap.

## Service shutdown

The control plane will be notified to shut down, either by receiving a SIGTERM signal from the user (via `rdd service stop`), or by the OS preparing to shut down the host.

Some controllers need to be notified of pending shutdown so they can gracefully terminate, e.g. shut down virtual machines.

Any controller can request shutdown notification by creating a ConfigMap in the `rdd-system` namespace:

cmd_service.md was updated to say SIGINT (with the Windows CTRL_BREAK_EVENT mapping) and the implementation calls process.Interrupt(pid) first at pkg/service/cmd/service.go:358-362, but the architecture doc api_service.md still names SIGTERM. The two docs now disagree on the shutdown contract — exactly what this PR is standardizing.

	// Kill fallback never fires. On Windows, Interrupt uses
	// GenerateConsoleCtrlEvent, which fails when caller and target lack a
	// shared console; Kill (TerminateProcess) then bypasses graceful shutdown.
	// Hostagents survive in their own process groups and self-heal on the
	// next service start via killOrphanedHostagent.
	if err := process.Interrupt(pid); err != nil {
		if err := process.Kill(pid); err != nil {
			return fmt.Errorf("failed to stop %q control plane with pid %d: %w", instance.Name(), pid, err)
		}
	}

	if wait {
		ctx := context.Background()
		if timeout > 0 {
			var cancel context.CancelFunc

Fix: update docs/design/api_service.md:21 to match: rdd service stop sends SIGINT on Unix or CTRL_BREAK_EVENT on Windows, then falls back to forced termination if graceful shutdown cannot be initiated or does not complete in time.

(suggestion, gap — pre-existing; the PR's companion doc update made the divergence worse.)

S8. Inconsistent alias for the client-go watch tools import Gemini

cmd/rdd/service.go imports k8s.io/client-go/tools/watch as watchtools, while cmd/rdd/limavm.go imports the same package as toolswatch. Standardize on one alias across the CLI to keep a quick text search from missing call sites.

(suggestion, gap)


Design Observations

Concerns

Strengths


Testing Assessment

Untested paths, in priority order:

  1. svc start --wait --timeout=… deadline expiry returns exit code 4. Most novel path in the PR; the --timeout flag on svc start is new. See S2.
  2. svc stop --wait --timeout=… deadline expiry returns exit code 4. StopWithWait was rewritten to use a context-driven loop; no test exercises the timeout-expiry branch. See S2.
  3. svc delete --timeout=… deadline expiry returns exit code 4 and leaves the directory in place. The doc explicitly promises this; no test asserts it. See S2.
  4. --timeout=0 (indefinite wait) succeeds when the desired state is eventually reached. A regression to context.WithTimeout(ctx, 0) (which times out immediately) would ship green. Round 1 testing gap, still unaddressed.
  5. ensureServiceRunning timeout during rdd limavm * / rdd set / rdd kubectl returns exit code 4. Directly ties to I1; the exit-code contract is not exercised for service-autostart paths.
  6. The documented --wait=falserdd svc start readiness workflow. S3's narrow race window and the docs' advertisement of this workflow; no BATS test covers it.
  7. The Round 1 TODO at bats/tests/20-service/2-create.bats:3 (# TODO: test rdd svc start --wait=false) remains unaddressed.

Documentation Assessment


Commit Structure

99aa4e3 and 8aa55e4 split correctly — the first is the feature, the second is the fixup for Round 1 findings. Both commit bodies understate the final contract (S5): 99aa4e3 lists pre-fixup defaults, and neither names the service.Stop() / service.WaitWithTimeout() library-surface removals. Consider squashing before merge, or amending either commit message to match the shipped code.


Acknowledged Limitations


Unresolved Feedback

No prior PR review comments on this PR.


Agent Performance Retro

Claude

Claude produced the most thorough pass again: six suggestions and six design observations, with the deepest technical detail per finding. Its unique material dominated: the Delete TOCTOU around the outer/inner Running() checks (S1), the PID-recycling race that turns process.Interrupt(PIDNotFound) into a process-group self-signal (design concern), the PR-description-now-lists-pre-fixup-defaults observation (S5), and the BATS --wait redundancy on limavm delete (S4). It independently identified the same serviceStartAction readiness weakness as Codex, but at Suggestion severity with a more careful framing of the race window — which matches the practical exposure better than Codex's Important classification. Its verdict ("Merge") was a hair more optimistic than the consolidated "Merge with minor fixes," because Claude did not flag the ensureServiceRunning exit-code gap that both Codex and Gemini caught.

Codex

Codex caught the Important finding the consolidation led with: ensureServiceRunning errors bypass timeoutError on every service-autostart path (rdd set, rdd limavm *, rdd kubectl), leaving five commands outside the advertised uniform exit-code-4 contract. It also uniquely caught the docs/design/api_service.md SIGTERM stale-doc (S7), which Claude and Gemini both missed despite reviewing the SIGINT/SIGTERM area in Round 1. Its call on the already-running-path readiness mismatch (I2 in its own report) was sharp on the documentation angle but over-weighted on severity; we consolidated at Suggestion alongside Claude's matching finding. No false positives; one clean Important and one unique stale-doc insight.

Gemini

Gemini produced its shortest report of this PR yet and delivered a mixed bag. It independently raised the ensureServiceRunning exit-code-4 gap (its I2, consolidated into our I1) — valuable corroboration of Codex. It caught a clean cosmetic suggestion (S8: watchtools vs toolswatch alias mismatch) that neither other agent noticed. But its critical-severity @test header claim was a false positive: Gemini parsed its own reading of the diff as the file contents and hallucinated a corrupted BATS syntax in lines that are in fact valid @test "…" declarations at limavm-cli.bats:208, 219. Its Important finding on StopWithWait ignoring cmd.Context() was technically accurate but practically a no-op today (the CLI has no signal-to-context plumbing, so cmd.Context() is itself context.Background()); we demoted it to a design observation alongside Claude's matching concern. Gemini continues to skip git blame, so regression vs. pre-existing calls come from the other two agents.

Summary

Claude Opus 4.7Codex GPT 5.4Gemini 3.1 Pro
Duration54m 15s6m 47s5m 05s
Findings6S1I 2S1I 1S
Tool calls59 (Bash 35, Grep 12, Read 12)62 (shell 59, stdin 3)18 (grepsearch 10, runshellcommand 5, readfile 3)
Design observations630
False positives001
Unique insights511
Files reviewed666
Coverage misses000
Totals6S1I 2S1I 1S
Downgraded01 (I→S)1 (I→S)
Dropped001

Reconciliation. Three severity changes:

Overall: Codex contributed the single most valuable finding (the ensureServiceRunning exit-code-4 gap) and the SIGTERM stale-doc. Claude carried the long tail of suggestions and unique design observations. Gemini corroborated the Important finding and caught the import-alias cosmetic, but its critical-severity call was spurious; with this PR's size and Round-2 narrow scope, Codex + Claude would have been sufficient.


Review Process Notes

Skill improvements

Repo context updates