Deep Review: 20260421-153207-pr-326

Date2026-04-21 15:32
Reporancher-sandbox/rancher-desktop-daemon
Round1
Author@jandubois
PR#326 — Fix deadlock in vendored nxadm/tail (observed as Windows CI hostagent-watcher hang)
Branchfix-tail-windows-deadlock
Commits673e91e Address review 1 findings for PR #326 + prep tail for spin-out
32015c9 tail/watch: synchronise watcher goroutine cleanup with Stop
a2d5c9f tail: resolve golangci-lint issues in the absorbed code
92881d9 rdd: switch hostagent watcher to local tail-based events package
e074dd9 tail/watch: fix deadlock in the shared InotifyTracker
35d6f05 tail: rename pkg/util/nxadmtail to pkg/util/tail, merge with CLI wrapper
a9bcf89 nxadmtail: absorb upstream watch/, util/, winfile/ subpackages
ReviewersClaude Opus 4.7 (effort: xhigh), Codex GPT 5.4 (effort: high), Gemini 3.1 Pro (effort: default)
VerdictMerge with fixes — one critical (newly-introduced panic race in the split tracker) and two important findings; the restructured tracker itself is sound but the goroutine split created a send-on-closed-channel window and the new Stream/Watch helpers don't close the lifecycle loop.
Wall-clock time22 min 47 s


Executive Summary

This PR vendors nxadm/tail in-tree, splits the shared InotifyTracker.run() into three goroutines to break the Windows deadlock on fsnotify.Errors, adds a sync.WaitGroup so tail.Stop() waits for per-watcher ChangeEvents goroutines to finish their untrack RPCs, and introduces pkg/hostagent/events as a local replacement for Lima's unpatched events.Watch. The structural fix is the right shape and the stress test coverage is thorough (200-cycle rotation, GOMAXPROCS=1, external writer subprocess to hit the fsnotify buffer-overflow path). The commit split is clean and bisectable.

However, the goroutine split introduced a new concurrency hazard: InotifyTracker.sendEvent now runs in its own drainer goroutine while removeWatch runs in the RPC goroutine, and their interaction can leave sendEvent in a select where both ch <- event (on a closed channel) and <-done (on a closed channel) are ready cases, giving a 50% chance of a panic that crashes the daemon. In addition, the new Stream helper returns as soon as t.Lines closes, which is before tail.watchers.Wait() completes — defeating the synchronization contract commit 6 added. And both Stream and pkg/hostagent/events.Watch swallow fatal tail.Err() on close, so a doomed tail loop loses its diagnostic on every iteration.


Critical Issues

C1. sendEvent can panic with send on closed channel after removeWatch Gemini
func (t *InotifyTracker) sendEvent(event fsnotify.Event) {
	name := filepath.Clean(event.Name)

	t.mux.Lock()
	ch := t.chans[name]
	done := t.done[name]
	t.mux.Unlock()

	if ch != nil && done != nil {
		select {
		case ch <- event:
		case <-done:
		}
	}
}

Splitting run() into three goroutines means sendEvent (events drainer) and removeWatch (RPC handler) now run concurrently. The remove helper at inotify_tracker.go:111-126 closes done first (under shared.mux), then sends on the unbuffered shared.remove; run() receives and calls removeWatch at line 174, which closes ch.

// inotify_tracker.go:115-124 — remove()
shared.mux.Lock()
done := shared.done[winfo.fname]
if done != nil {
    delete(shared.done, winfo.fname)
    close(done)
}
shared.mux.Unlock()
shared.remove <- winfo
// inotify_tracker.go:176-181 — removeWatch()
t.mux.Lock()
ch := t.chans[winfo.fname]
if ch != nil {
    delete(t.chans, winfo.fname)
    close(ch)
}

Race: sendEvent unlocks with non-nil ch and done. Before it reaches the select, remove closes done and removeWatch closes ch. On entry to the select, both cases are ready (a send on a closed channel is selectable and panics; a receive on a closed channel is selectable and returns the zero value). Go picks pseudo-randomly, so ~50% of the time this path is taken it panics with send on closed channel, crashing the daemon. In the pre-PR upstream code sendEvent and removeWatch ran on the same goroutine inside a single select, so the race did not exist.

The window is small (between t.mux.Unlock() and entry to the select) but non-zero. Under the rotation + event storm pattern this PR's own stress test produces, the collision can happen; it would read as a flaky crash rather than a deadlock, so the stress test is unlikely to catch it.

Fix: avoid the send-on-closed case by making the send conditional on done not being closed, e.g. gate the send with a non-blocking poll of done first, or protect the send with defer func() { _ = recover() }() on the understanding that dropping an event for a channel that is being removed is correct. The cleanest structural fix is to have removeWatch not close ch (just delete it from the map) and rely on t.Dying() from the caller's tomb to terminate the per-file goroutine in inotify.go; that is larger than the minimum change, so the targeted recover is a pragmatic stopgap while the structural redesign is planned.


Important Issues

I1. Stream returns before tail.watchers.Wait() completes, bypassing the commit-6 synchronization Gemini
var writeErr error
for line := range t.Lines {
    if writeErr != nil {
        continue // drain Lines so tailFileSync can exit cleanly
    }
    if _, err := fmt.Fprintln(writer, line.Text); err != nil {
        writeErr = err
        t.Kill(err) // non-blocking; keep draining Lines so tailFileSync can finish
    }
}
return writeErr

Commit 32015c9 added Tail.watchers sync.WaitGroup specifically so tail.Stop() / tail.Wait() does not return until the per-watcher ChangeEvents goroutine finishes its untrack RPC — the whole point being that a caller immediately re-opening the same filename must not race with a still-running untrack on the shared tracker. The defer order in tailFileSync (tail.go:269-272) is LIFO:

defer tail.Done()
defer tail.watchers.Wait()
defer tail.close()

So tail.close() runs first (closing Lines), then watchers.Wait() blocks until the ChangeEvents goroutine exits. Stream's for line := range t.Lines unblocks as soon as Lines is closed and returns without waiting for the rest of the defer chain. A future caller that treats Stream's return as "the tail is fully torn down" and immediately re-opens the same file reproduces exactly the race commit 6 set out to fix.

The current Stream callers (cmd/rdd/service.go and cmd/rdd/limavm.go) are one-shot CLI commands that exit the process after Stream returns, so this is latent today. It becomes a correctness trap the moment any long-running path calls Stream twice on the same filename, which is a realistic pattern for a log-streaming helper.

Fix: after the for loop, call _ = t.Wait() (or _ = t.Stop() if the loop exited because the caller cancelled) before returning. The t.Wait() form is safe because tomb is already Dying or Dead by the time Lines closes, so Wait returns once the full defer chain (including watchers.Wait) has run.

 	for line := range t.Lines {
 		// ...
 	}
+	_ = t.Wait()
 	return writeErr
I2. pkg/hostagent/events.Watch and tail.Stream silently swallow fatal tail.Err() on close Gemini
case line := <-haStdoutTail.Lines:
    if line == nil {
        return nil
    }

When tailFileSync hits a non-EOF read error (tail.go:371: _ = tail.Killf("Error reading %s: %s", tail.Filename, err)), a seek error, or an unrecoverable reopen failure, it calls Kill / Killf on the tomb and then lets the defer chain close Lines. Both events.Watch and Stream treat "Lines closed" as clean shutdown and return nil, discarding tail.Err().

In the controller path (hostagent_watcher.go:130-134), runWatcher only logs err when Watch returns non-nil. If Watch returns nil after a tail failure, the controller sees no diagnostic and proceeds to set phaseStopped and enqueue a reconcile — the hostagent will then be restarted, and if the new tail hits the same failure mode (e.g. permission denied on the log file, inotify instance exhaustion) the cycle repeats silently. There is no evidence in logs that the tail itself failed.

Fix: propagate tail.Err() back to the caller.

 	case line := <-haStdoutTail.Lines:
 		if line == nil {
-			return nil
+			return haStdoutTail.Err()
 		}

Apply the same change to the haStderrTail.Lines case at events.go:129-131 and to the end of Stream:

-	return writeErr
+	if writeErr != nil {
+		return writeErr
+	}
+	return t.Err()

tail.Err() returns nil (or tomb.ErrStillAlive if called before the tomb dies, but here the range loop exited so the tomb is dead) for clean shutdowns, so the propagation is transparent when nothing went wrong.


Suggestions

S1. Duplicate doc comment block on writerEnvVar Claude
// writerEnvVar makes the test binary run as a subprocess that writes
// JSON event lines to its stdout until stdin is closed. We self-exec
// (see TestMain) because an external writer process produces the
// ReadDirectoryChanges traffic volume needed to trigger fsnotify's
// "buffer larger than it is" error path on Windows; an in-process
// goroutine writer does not.
// writerEnvVar makes the test binary run as a subprocess that writes
// JSON event lines to its stdout until stdin is closed. We self-exec
// (see TestMain) because an external writer process produces the
// ReadDirectoryChanges traffic volume needed to trigger fsnotify's
// "buffer larger than it is" error path on Windows; an in-process
// goroutine writer does not.
const writerEnvVar = "TAIL_TEST_WRITER"

Lines 25-30 and 31-36 are byte-identical — an editing accident during the review-1 commit (673e91e). Delete the duplicate block.

S2. FileWatcher.ChangeEvents doc claims the goroutine returns after truncation, but neither implementation does Claude
// ChangeEvents reports on changes to a file, be it modification,
// deletion, renames or truncations. After a deletion or truncation
// event the implementation's goroutine returns and no further
// notifications arrive on the returned FileChanges; the channels
// themselves stay open.

The inotify.go:136-140 truncation branch calls NotifyTruncated and falls through to the enclosing for loop:

if prevSize > 0 && prevSize > fw.Size {
    changes.NotifyTruncated()
} else {
    changes.NotifyModified()
}

polling.go:106-109 does the same with continue. tail.waitForChanges (tail.go:418-426) depends on this: after truncation it calls reopen() without clearing tail.changes, then the next iteration re-enters waitForChanges expecting the same FileChanges object to keep firing. If the goroutine really returned after truncation, tail.changes.Modified would block forever on the next call.

Fix:

 // ChangeEvents reports on changes to a file, be it modification,
-// deletion, renames or truncations. After a deletion or truncation
-// event the implementation's goroutine returns and no further
-// notifications arrive on the returned FileChanges; the channels
-// themselves stay open.
+// deletion, renames or truncations. After a deletion or rename event
+// the implementation's goroutine returns and no further notifications
+// arrive on the returned FileChanges; the channels themselves stay
+// open. Truncation and modification events are delivered indefinitely.
S3. watch package doc still describes itself as part of the vendored nxadm/tail Claude
// Package watch provides the FileWatcher abstraction used by the
// vendored nxadm/tail to observe changes to a file.

After a9bcf89 absorbed these subpackages and 35d6f05 moved them out from under pkg/util/nxadmtail, the top-level tail package doc correctly describes the fork status (tail.go:8-18). This sub-package doc was not updated and still names a consumer — "vendored nxadm/tail" — that no longer exists in the tree.

Fix:

-// Package watch provides the FileWatcher abstraction used by the
-// vendored nxadm/tail to observe changes to a file.
+// Package watch provides the FileWatcher abstraction used by
+// pkg/util/tail to observe changes to a file.
S4. runFakeHostagentSubprocess writes from two goroutines into a single non-thread-safe json.Encoder Claude
enc := json.NewEncoder(os.Stdout)
_ = enc.Encode(fakeEvent{Time: time.Now(), Status: fakeStatus{SSHLocalPort: 22}})
time.AfterFunc(200*time.Millisecond, func() {
    _ = enc.Encode(fakeEvent{Time: time.Now(), Status: fakeStatus{SSHLocalPort: 22, Running: true}})
})

time.AfterFunc runs its callback on its own goroutine. The tick loop below starts firing ~25 ms after runFakeHostagentSubprocess begins; the AfterFunc goroutine fires at 200 ms and races with the tick goroutine's enc.Encode calls. json.Encoder is explicitly not safe for concurrent writers — two interleaved Encode calls can produce a garbled line. For this test the consumer (runDualTail) only counts lines, so a corrupt line still counts; but the subprocess output is no longer a clean JSON stream, which is both surprising to a future test that tries to decode it and a potential source of flakes for a future strict parser.

Fix: drop time.AfterFunc and emit the Running event once from the tick loop when the appropriate elapsed time has passed, or guard the encoder with a sync.Mutex. The single-goroutine form is cleaner:

enc := json.NewEncoder(os.Stdout)
_ = enc.Encode(fakeEvent{Time: time.Now(), Status: fakeStatus{SSHLocalPort: 22}})
runningSent := false
start := time.Now()
// ...
for {
    select {
    // ...
    case <-tick.C:
        if !runningSent && time.Since(start) > 200*time.Millisecond {
            _ = enc.Encode(fakeEvent{Time: time.Now(), Status: fakeStatus{SSHLocalPort: 22, Running: true}})
            runningSent = true
        }
        for range 10 {
            _ = enc.Encode(fakeEvent{Time: time.Now(), Status: fakeStatus{SSHLocalPort: 22}})
        }
    }
}
S5. context.Background() inside runRotationCycle bypasses the test's own context Claude Codex Gemini
tailCtx, tailCancel := context.WithCancel(context.Background())

Three lines above at line 114 the same function already uses t.Context() (exec.CommandContext(t.Context(), ...)), so the convention for the file is the test-scoped context. Using context.Background() here means a test timeout or t.Fatal elsewhere cannot cancel the in-flight tail; the cycle waits out its 10 s cycleStopBudget before giving up, and if the assertion within fails before tailCancel() runs, the goroutine started on the next line can outlive the test.

Fix:

-tailCtx, tailCancel := context.WithCancel(context.Background())
+tailCtx, tailCancel := context.WithCancel(t.Context())

Design Observations

Strengths

Concerns


Testing Assessment

Unit-test coverage for the new code is solid:

Gaps, ranked:

  1. sendEvent / removeWatch concurrency (covers C1) — No test exercises a rotation+event-storm where the two goroutines race on the same filename with default GOMAXPROCS. This is the race class C1 describes.
  2. Stream same-file re-open (covers I1) — No test opens Stream twice back-to-back on the same filename, which would catch a regression of the commit-6 synchronization contract.
  3. events.Watch fatal-tail-error propagation (covers I2)events_test.go lacks a case where the underlying tail fails with a fatal error (e.g. passing a directory path, or simulating a mid-stream read error) and asserts the error is returned Gemini.
  4. Controller path through pkg/hostagent/events.Watch — No controller-level test drives hostagent_watcher.go:130 through the new local package and verifies reconcile re-enqueue on the fixed path; today that behavior is validated only by lower-level tests and CI Codex.
  5. BlockUntilExists cancel from CLIstream_test.go does not cover cancelling while waiting for a missing file to appear, which is the CLI path most likely to exercise BlockUntilExists() after the vendoring change Codex.
  6. Truncation event delivery (covers S2) — Neither the stress test nor events_test.go exercises truncation on the tailed file. A small unit test that shrinks the file and asserts continued event flow would document the real contract and guard future refactors Claude.

Documentation Assessment


Commit Structure

Clean. Each of the seven commits is a self-contained concept:

Commit messages are specific about what changed and why. Bisect across this range would pinpoint a regression in the correct concept.


Acknowledged Limitations


Unresolved Feedback

No unresolved inline review comments on the PR as of this review.


Agent Performance Retro

Claude

Produced the most findings by volume — five suggestions, all valid — covering documentation drift (S2, S3), an editing artifact (S1), a test-helper concurrency bug (S4), and the t.Context() hygiene issue (S5). Anchored the Design Observations and Testing Assessment sections with concrete, evidence-based strengths and coverage gaps. Missed the two concurrency findings that Gemini caught (C1, I1, I2) despite them being the most impactful issues in the review — traced the defer chain in tailFileSync deeply enough to note the sync.WaitGroup contract but did not connect that to the Stream consumer or to the drainer/RPC goroutine split's effect on sendEvent / removeWatch concurrency. Calibration felt right: nothing flagged as Critical or Important that wasn't, and the "no blocking issues" verdict was compatible with a reasonable reading of the code — but the deeper races were genuinely missed, not just softened.

Codex

Single suggestion (S5), which Gemini and Claude both also found. Strong Design Observations and accurate Testing Assessment gaps (3 and 5 in the consolidated list). Did run the targeted test suite and reported it green, which is useful confirmation. Missed every race and synchronization issue — C1, I1, I2 all ungazed. The review reads as a competent "nothing alarming" pass rather than an independent verification of the new concurrency primitives. Low value on a PR whose whole point is concurrency.

Gemini

Landed both concurrency findings (C1: sendEvent panic race, I1: Stream bypasses watchers.Wait) and the error-swallowing gap (I2). All three required tracing defer chains, goroutine boundaries, and channel ownership across three files — exactly the failure mode the rest of the PR tries to prevent, and exactly what a concurrency review should catch. Initial severity on I1 and I2 was Critical in Gemini's own output; consolidation downgraded I1 to Important (current callers don't trigger) and I2 to Important (controller path still enqueues a reconcile), but the findings themselves were on target. Coverage summary was thorough and accurate. Did not run git blame (known quota-driven behavior) so did not distinguish regressions from pre-existing — in this PR it did not matter because the two important findings both originated in this PR's commits. Highest-value reviewer on this pass.

Summary

Claude Opus 4.7Codex GPT 5.4Gemini 3.1 Pro
Duration15m 39s8m 38s6m 09s
Findings5S1S1C 2I 1S
Tool calls44 (Bash 26, Read 18)59 (shell 56, stdin 3)3 (runshellcommand 3)
Design observations420
False positives000
Unique insights503
Files reviewed242424
Coverage misses220
Totals5S1S1C 2I 1S
Downgraded002 (I→S)
Dropped000

Reconciliation: Gemini raised its C1 (Stream bypasses watchers.Wait) at Critical severity; consolidation downgraded to Important I1 because current callers (CLI one-shot commands) do not trigger the race — the finding is a latent contract violation. Gemini raised its I1 (swallowed tail errors) at Important; retained as Important I2 with the note that the controller still reconciles on phaseStopped, so the impact is diagnostic rather than wedging. Gemini's C2 (sendEvent panic) retained as Critical C1.

Gemini was the clear winner here. Its two concurrency findings were the highest-value output of the review; without them the review would have produced only polish-level suggestions on a PR whose whole purpose is to fix concurrency. Claude provided the polish coverage and a strong structural read. Codex contributed one finding already found by the other two. Running Opus and Gemini as the pair would have covered essentially the same ground at lower cost; Codex's value-add on this PR was negligible.


Review Process Notes

Skill improvements

Repo context updates

None this pass — existing deep-review-context.md guidance was still accurate.