Deep Review: 20260421-153207-pr-326

Date	2026-04-21 15:32
Repo	rancher-sandbox/rancher-desktop-daemon
Round	1
Author	@jandubois
PR	#326 — Fix deadlock in vendored nxadm/tail (observed as Windows CI hostagent-watcher hang)
Branch	`fix-tail-windows-deadlock`
Commits	`673e91e` Address review 1 findings for PR #326 + prep tail for spin-out `32015c9` tail/watch: synchronise watcher goroutine cleanup with Stop `a2d5c9f` tail: resolve golangci-lint issues in the absorbed code `92881d9` rdd: switch hostagent watcher to local tail-based events package `e074dd9` tail/watch: fix deadlock in the shared InotifyTracker `35d6f05` tail: rename pkg/util/nxadmtail to pkg/util/tail, merge with CLI wrapper `a9bcf89` nxadmtail: absorb upstream watch/, util/, winfile/ subpackages
Reviewers	Claude Opus 4.7 (effort: xhigh), Codex GPT 5.4 (effort: high), Gemini 3.1 Pro (effort: default)
Verdict	Merge with fixes — one critical (newly-introduced panic race in the split tracker) and two important findings; the restructured tracker itself is sound but the goroutine split created a send-on-closed-channel window and the new Stream/Watch helpers don't close the lifecycle loop.
Wall-clock time	`22 min 47 s`

Executive Summary ¶

This PR vendors nxadm/tail in-tree, splits the shared InotifyTracker.run() into three goroutines to break the Windows deadlock on fsnotify.Errors, adds a sync.WaitGroup so tail.Stop() waits for per-watcher ChangeEvents goroutines to finish their untrack RPCs, and introduces pkg/hostagent/events as a local replacement for Lima's unpatched events.Watch. The structural fix is the right shape and the stress test coverage is thorough (200-cycle rotation, GOMAXPROCS=1, external writer subprocess to hit the fsnotify buffer-overflow path). The commit split is clean and bisectable.

However, the goroutine split introduced a new concurrency hazard: InotifyTracker.sendEvent now runs in its own drainer goroutine while removeWatch runs in the RPC goroutine, and their interaction can leave sendEvent in a select where both ch <- event (on a closed channel) and <-done (on a closed channel) are ready cases, giving a 50% chance of a panic that crashes the daemon. In addition, the new Stream helper returns as soon as t.Lines closes, which is before tail.watchers.Wait() completes — defeating the synchronization contract commit 6 added. And both Stream and pkg/hostagent/events.Watch swallow fatal tail.Err() on close, so a doomed tail loop loses its diagnostic on every iteration.

Critical Issues ¶

C1. sendEvent can panic with send on closed channel after removeWatch Gemini¶

func (t *InotifyTracker) sendEvent(event fsnotify.Event) {
	name := filepath.Clean(event.Name)

	t.mux.Lock()
	ch := t.chans[name]
	done := t.done[name]
	t.mux.Unlock()

	if ch != nil && done != nil {
		select {
		case ch <- event:
		case <-done:
		}
	}
}

Splitting run() into three goroutines means sendEvent (events drainer) and removeWatch (RPC handler) now run concurrently. The remove helper at inotify_tracker.go:111-126 closes done first (under shared.mux), then sends on the unbuffered shared.remove; run() receives and calls removeWatch at line 174, which closes ch.

// inotify_tracker.go:115-124 — remove()
shared.mux.Lock()
done := shared.done[winfo.fname]
if done != nil {
    delete(shared.done, winfo.fname)
    close(done)
}
shared.mux.Unlock()
shared.remove <- winfo

// inotify_tracker.go:176-181 — removeWatch()
t.mux.Lock()
ch := t.chans[winfo.fname]
if ch != nil {
    delete(t.chans, winfo.fname)
    close(ch)
}

Race: sendEvent unlocks with non-nil ch and done. Before it reaches the select, remove closes done and removeWatch closes ch. On entry to the select, both cases are ready (a send on a closed channel is selectable and panics; a receive on a closed channel is selectable and returns the zero value). Go picks pseudo-randomly, so ~50% of the time this path is taken it panics with send on closed channel, crashing the daemon. In the pre-PR upstream code sendEvent and removeWatch ran on the same goroutine inside a single select, so the race did not exist.

The window is small (between t.mux.Unlock() and entry to the select) but non-zero. Under the rotation + event storm pattern this PR's own stress test produces, the collision can happen; it would read as a flaky crash rather than a deadlock, so the stress test is unlikely to catch it.

Fix: avoid the send-on-closed case by making the send conditional on done not being closed, e.g. gate the send with a non-blocking poll of done first, or protect the send with defer func() { _ = recover() }() on the understanding that dropping an event for a channel that is being removed is correct. The cleanest structural fix is to have removeWatch not close ch (just delete it from the map) and rely on t.Dying() from the caller's tomb to terminate the per-file goroutine in inotify.go; that is larger than the minimum change, so the targeted recover is a pragmatic stopgap while the structural redesign is planned.

Important Issues ¶

I1. Stream returns before tail.watchers.Wait() completes, bypassing the commit-6 synchronization Gemini¶

var writeErr error
for line := range t.Lines {
    if writeErr != nil {
        continue // drain Lines so tailFileSync can exit cleanly
    }
    if _, err := fmt.Fprintln(writer, line.Text); err != nil {
        writeErr = err
        t.Kill(err) // non-blocking; keep draining Lines so tailFileSync can finish
    }
}
return writeErr

Commit 32015c9 added Tail.watchers sync.WaitGroup specifically so tail.Stop() / tail.Wait() does not return until the per-watcher ChangeEvents goroutine finishes its untrack RPC — the whole point being that a caller immediately re-opening the same filename must not race with a still-running untrack on the shared tracker. The defer order in tailFileSync (tail.go:269-272) is LIFO:

defer tail.Done()
defer tail.watchers.Wait()
defer tail.close()

So tail.close() runs first (closing Lines), then watchers.Wait() blocks until the ChangeEvents goroutine exits. Stream's for line := range t.Lines unblocks as soon as Lines is closed and returns without waiting for the rest of the defer chain. A future caller that treats Stream's return as "the tail is fully torn down" and immediately re-opens the same file reproduces exactly the race commit 6 set out to fix.

The current Stream callers (cmd/rdd/service.go and cmd/rdd/limavm.go) are one-shot CLI commands that exit the process after Stream returns, so this is latent today. It becomes a correctness trap the moment any long-running path calls Stream twice on the same filename, which is a realistic pattern for a log-streaming helper.

Fix: after the for loop, call _ = t.Wait() (or _ = t.Stop() if the loop exited because the caller cancelled) before returning. The t.Wait() form is safe because tomb is already Dying or Dead by the time Lines closes, so Wait returns once the full defer chain (including watchers.Wait) has run.

 	for line := range t.Lines {
 		// ...
 	}
+	_ = t.Wait()
 	return writeErr

I2. pkg/hostagent/events.Watch and tail.Stream silently swallow fatal tail.Err() on close Gemini¶

case line := <-haStdoutTail.Lines:
    if line == nil {
        return nil
    }

When tailFileSync hits a non-EOF read error (tail.go:371: _ = tail.Killf("Error reading %s: %s", tail.Filename, err)), a seek error, or an unrecoverable reopen failure, it calls Kill / Killf on the tomb and then lets the defer chain close Lines. Both events.Watch and Stream treat "Lines closed" as clean shutdown and return nil, discarding tail.Err().

In the controller path (hostagent_watcher.go:130-134), runWatcher only logs err when Watch returns non-nil. If Watch returns nil after a tail failure, the controller sees no diagnostic and proceeds to set phaseStopped and enqueue a reconcile — the hostagent will then be restarted, and if the new tail hits the same failure mode (e.g. permission denied on the log file, inotify instance exhaustion) the cycle repeats silently. There is no evidence in logs that the tail itself failed.

Fix: propagate tail.Err() back to the caller.

 	case line := <-haStdoutTail.Lines:
 		if line == nil {
-			return nil
+			return haStdoutTail.Err()
 		}

Apply the same change to the haStderrTail.Lines case at events.go:129-131 and to the end of Stream:

-	return writeErr
+	if writeErr != nil {
+		return writeErr
+	}
+	return t.Err()

tail.Err() returns nil (or tomb.ErrStillAlive if called before the tomb dies, but here the range loop exited so the tomb is dead) for clean shutdowns, so the propagation is transparent when nothing went wrong.

Suggestions ¶

S1. Duplicate doc comment block on writerEnvVar Claude¶

// writerEnvVar makes the test binary run as a subprocess that writes
// JSON event lines to its stdout until stdin is closed. We self-exec
// (see TestMain) because an external writer process produces the
// ReadDirectoryChanges traffic volume needed to trigger fsnotify's
// "buffer larger than it is" error path on Windows; an in-process
// goroutine writer does not.
// writerEnvVar makes the test binary run as a subprocess that writes
// JSON event lines to its stdout until stdin is closed. We self-exec
// (see TestMain) because an external writer process produces the
// ReadDirectoryChanges traffic volume needed to trigger fsnotify's
// "buffer larger than it is" error path on Windows; an in-process
// goroutine writer does not.
const writerEnvVar = "TAIL_TEST_WRITER"

Lines 25-30 and 31-36 are byte-identical — an editing accident during the review-1 commit (673e91e). Delete the duplicate block.

S2. FileWatcher.ChangeEvents doc claims the goroutine returns after truncation, but neither implementation does Claude¶

// ChangeEvents reports on changes to a file, be it modification,
// deletion, renames or truncations. After a deletion or truncation
// event the implementation's goroutine returns and no further
// notifications arrive on the returned FileChanges; the channels
// themselves stay open.

The inotify.go:136-140 truncation branch calls NotifyTruncated and falls through to the enclosing for loop:

if prevSize > 0 && prevSize > fw.Size {
    changes.NotifyTruncated()
} else {
    changes.NotifyModified()
}

polling.go:106-109 does the same with continue. tail.waitForChanges (tail.go:418-426) depends on this: after truncation it calls reopen() without clearing tail.changes, then the next iteration re-enters waitForChanges expecting the same FileChanges object to keep firing. If the goroutine really returned after truncation, tail.changes.Modified would block forever on the next call.

Fix:

 // ChangeEvents reports on changes to a file, be it modification,
-// deletion, renames or truncations. After a deletion or truncation
-// event the implementation's goroutine returns and no further
-// notifications arrive on the returned FileChanges; the channels
-// themselves stay open.
+// deletion, renames or truncations. After a deletion or rename event
+// the implementation's goroutine returns and no further notifications
+// arrive on the returned FileChanges; the channels themselves stay
+// open. Truncation and modification events are delivered indefinitely.

S3. watch package doc still describes itself as part of the vendored nxadm/tail Claude¶

// Package watch provides the FileWatcher abstraction used by the
// vendored nxadm/tail to observe changes to a file.

After a9bcf89 absorbed these subpackages and 35d6f05 moved them out from under pkg/util/nxadmtail, the top-level tail package doc correctly describes the fork status (tail.go:8-18). This sub-package doc was not updated and still names a consumer — "vendored nxadm/tail" — that no longer exists in the tree.

Fix:

-// Package watch provides the FileWatcher abstraction used by the
-// vendored nxadm/tail to observe changes to a file.
+// Package watch provides the FileWatcher abstraction used by
+// pkg/util/tail to observe changes to a file.

S4. runFakeHostagentSubprocess writes from two goroutines into a single non-thread-safe json.Encoder Claude¶

enc := json.NewEncoder(os.Stdout)
_ = enc.Encode(fakeEvent{Time: time.Now(), Status: fakeStatus{SSHLocalPort: 22}})
time.AfterFunc(200*time.Millisecond, func() {
    _ = enc.Encode(fakeEvent{Time: time.Now(), Status: fakeStatus{SSHLocalPort: 22, Running: true}})
})

time.AfterFunc runs its callback on its own goroutine. The tick loop below starts firing ~25 ms after runFakeHostagentSubprocess begins; the AfterFunc goroutine fires at 200 ms and races with the tick goroutine's enc.Encode calls. json.Encoder is explicitly not safe for concurrent writers — two interleaved Encode calls can produce a garbled line. For this test the consumer (runDualTail) only counts lines, so a corrupt line still counts; but the subprocess output is no longer a clean JSON stream, which is both surprising to a future test that tries to decode it and a potential source of flakes for a future strict parser.

Fix: drop time.AfterFunc and emit the Running event once from the tick loop when the appropriate elapsed time has passed, or guard the encoder with a sync.Mutex. The single-goroutine form is cleaner:

enc := json.NewEncoder(os.Stdout)
_ = enc.Encode(fakeEvent{Time: time.Now(), Status: fakeStatus{SSHLocalPort: 22}})
runningSent := false
start := time.Now()
// ...
for {
    select {
    // ...
    case <-tick.C:
        if !runningSent && time.Since(start) > 200*time.Millisecond {
            _ = enc.Encode(fakeEvent{Time: time.Now(), Status: fakeStatus{SSHLocalPort: 22, Running: true}})
            runningSent = true
        }
        for range 10 {
            _ = enc.Encode(fakeEvent{Time: time.Now(), Status: fakeStatus{SSHLocalPort: 22}})
        }
    }
}

S5. context.Background() inside runRotationCycle bypasses the test's own context Claude Codex Gemini¶

tailCtx, tailCancel := context.WithCancel(context.Background())

Three lines above at line 114 the same function already uses t.Context() (exec.CommandContext(t.Context(), ...)), so the convention for the file is the test-scoped context. Using context.Background() here means a test timeout or t.Fatal elsewhere cannot cancel the in-flight tail; the cycle waits out its 10 s cycleStopBudget before giving up, and if the assertion within fails before tailCancel() runs, the goroutine started on the next line can outlive the test.

Fix:

-tailCtx, tailCancel := context.WithCancel(context.Background())
+tailCtx, tailCancel := context.WithCancel(t.Context())

Design Observations ¶

Strengths ¶

Single-watcher-per-process invariant preserved — Splitting run() into three goroutines while keeping one shared fsnotify.Watcher retains Linux's fs.inotify.max_user_instances=128 budget safety, which is an easy win to lose accidentally when "just add more goroutines". (in-scope) Claude Codex
Containment boundary at pkg/hostagent/events — Introducing a local Watch keyed to rdd's subset of the Lima JSON schema removes the last live path through Lima's unpatched events.Watch without forking more of Lima than rdd actually consumes, and keeps the pkg/util/tail library free of rdd-specific logr glue. (in-scope) Claude Codex
Test designed around the actual failure mode — TestInotifyTrackerNoDeadlockOnRepeatedRotation self-execs an external writer subprocess (the only way to hit fsnotify's Windows buffer-overflow path at useful volume) and pins GOMAXPROCS=1 to serialise the scheduler the way a 2-vCPU windows-latest runner does. The PR's own lifecycle race (fixed in commit 6) was discovered by this same test running on Linux. (in-scope) Claude
Root-cause-first documentation in the tracker — The run() doc comment at inotify_tracker.go:224-244 explains the four-job-one-select deadlock cycle, and the watchers sync.WaitGroup comment on Tail documents the lifecycle contract. A future reader understands the why without replaying the debugging session. (in-scope) Claude

Concerns ¶

No test for concurrent tails of the same filename — The sync.WaitGroup fix closes the sequential window (tail1 stops, tail2 opens). It does not address two tails alive simultaneously on the same filename — removeWatch unconditionally deletes the map entry and closes the per-file channel, which would yank the channel out from under a second live tail on the same path. Not triggered by rdd's current callers (one tail per hostagent stdout/stderr path) but undocumented as a precondition. (future) Claude
Stress test trains on a single Goroutine scheduler — GOMAXPROCS=1 reliably reproduces the original deadlock and the Linux lifecycle race, but masks the send-on-closed-channel race from C1 (which requires two CPUs to actually reorder sendEvent.Unlock() against removeWatch.close(ch)). Consider a second variant of the test with default GOMAXPROCS to exercise the multi-CPU race surface. (future)

Testing Assessment ¶

Unit-test coverage for the new code is solid:

pkg/hostagent/events/events_test.go covers stop-on-callback, begin filter, malformed-JSON recovery, ctx cancel, stderr drainage — the five cases called out in commit 673e91e Claude.
TestInotifyTrackerNoDeadlockOnRepeatedRotation covers both the deadlock and the lifecycle race, red-before/green-after verified in the PR description.
Targeted go test ./pkg/util/tail ./pkg/hostagent/events ./pkg/controllers/lima/limavm/controllers passes Codex.

Gaps, ranked:

sendEvent / removeWatch concurrency (covers C1) — No test exercises a rotation+event-storm where the two goroutines race on the same filename with default GOMAXPROCS. This is the race class C1 describes.
Stream same-file re-open (covers I1) — No test opens Stream twice back-to-back on the same filename, which would catch a regression of the commit-6 synchronization contract.
events.Watch fatal-tail-error propagation (covers I2) — events_test.go lacks a case where the underlying tail fails with a fatal error (e.g. passing a directory path, or simulating a mid-stream read error) and asserts the error is returned Gemini.
Controller path through pkg/hostagent/events.Watch — No controller-level test drives hostagent_watcher.go:130 through the new local package and verifies reconcile re-enqueue on the fixed path; today that behavior is validated only by lower-level tests and CI Codex.
BlockUntilExists cancel from CLI — stream_test.go does not cover cancelling while waiting for a missing file to appear, which is the CLI path most likely to exercise BlockUntilExists() after the vendoring change Codex.
Truncation event delivery (covers S2) — Neither the stress test nor events_test.go exercises truncation on the tailed file. A small unit test that shrinks the file and asserts continued event flow would document the real contract and guard future refactors Claude.

Documentation Assessment ¶

Exported symbols in the new packages (Watch, Event, Status, Stream, Open, Config, Tail, etc.) all have godoc comments Claude.
Two comments are stale (S2, S3); one is duplicated (S1).
pkg/util/tail/tail.go:67-85's Config.RateLimiter was removed from the struct in the original vendoring, but the cooloff branch inside tailFileSync:311-328 is still live and references a rate-limit scenario that can never fire. Pre-existing, flagged here as spin-out cleanup rather than a PR blocker Claude.
The NOTICE update and the new package comments adequately explain the vendored-tail move and the local hostagent event wrapper Codex.

Commit Structure ¶

Clean. Each of the seven commits is a self-contained concept:

a9bcf89 absorbs upstream files (mechanical).
35d6f05 renames + merges with the CLI wrapper.
e074dd9 applies the deadlock fix + stress test.
92881d9 cuts over hostagent_watcher to the fork.
a2d5c9f resolves lint debt introduced by the absorption.
32015c9 fixes the second (Linux) lifecycle race discovered by the stress test.
673e91e addresses prior review feedback and MIT-relicensing prep for spin-out.

Commit messages are specific about what changed and why. Bisect across this range would pinpoint a regression in the correct concept.

Acknowledged Limitations ¶

XXX: we must not log from a library. at pkg/util/tail/tail.go:407 — pre-existing from upstream nxadm/tail, inherited as-is.
XXX: report this error back to the user at pkg/util/tail/watch/inotify.go:131 and polling.go:94 — pre-existing upstream panic(...) on unexpected Stat failure; not addressed, not regressed Claude Codex.
XXX: use tomb.Tomb to cleanly manage these goroutines at pkg/util/tail/watch/polling.go:68 — pre-existing Claude.
Lima's events.Watch remains unpatched; rdd bypasses it — explicit in the "Out of scope" section of the PR description and in the new package's doc comment. Consistent with the stated plan to share the fix with Lima later Claude.
"Do NOT call Cleanup" warning on both tail handles at pkg/hostagent/events/events.go:88-99 — deliberate preservation of Lima's long-standing constraint; mirrored on the stderr defer as part of the review-1 commit Claude Codex.

Unresolved Feedback ¶

No unresolved inline review comments on the PR as of this review.

Agent Performance Retro ¶

Claude ¶

Produced the most findings by volume — five suggestions, all valid — covering documentation drift (S2, S3), an editing artifact (S1), a test-helper concurrency bug (S4), and the t.Context() hygiene issue (S5). Anchored the Design Observations and Testing Assessment sections with concrete, evidence-based strengths and coverage gaps. Missed the two concurrency findings that Gemini caught (C1, I1, I2) despite them being the most impactful issues in the review — traced the defer chain in tailFileSync deeply enough to note the sync.WaitGroup contract but did not connect that to the Stream consumer or to the drainer/RPC goroutine split's effect on sendEvent / removeWatch concurrency. Calibration felt right: nothing flagged as Critical or Important that wasn't, and the "no blocking issues" verdict was compatible with a reasonable reading of the code — but the deeper races were genuinely missed, not just softened.

Codex ¶

Single suggestion (S5), which Gemini and Claude both also found. Strong Design Observations and accurate Testing Assessment gaps (3 and 5 in the consolidated list). Did run the targeted test suite and reported it green, which is useful confirmation. Missed every race and synchronization issue — C1, I1, I2 all ungazed. The review reads as a competent "nothing alarming" pass rather than an independent verification of the new concurrency primitives. Low value on a PR whose whole point is concurrency.

Gemini ¶

Landed both concurrency findings (C1: sendEvent panic race, I1: Stream bypasses watchers.Wait) and the error-swallowing gap (I2). All three required tracing defer chains, goroutine boundaries, and channel ownership across three files — exactly the failure mode the rest of the PR tries to prevent, and exactly what a concurrency review should catch. Initial severity on I1 and I2 was Critical in Gemini's own output; consolidation downgraded I1 to Important (current callers don't trigger) and I2 to Important (controller path still enqueues a reconcile), but the findings themselves were on target. Coverage summary was thorough and accurate. Did not run git blame (known quota-driven behavior) so did not distinguish regressions from pre-existing — in this PR it did not matter because the two important findings both originated in this PR's commits. Highest-value reviewer on this pass.

Summary ¶

	Claude Opus 4.7	Codex GPT 5.4	Gemini 3.1 Pro
Duration	15m 39s	8m 38s	6m 09s
Findings	5S	1S	1C 2I 1S
Tool calls	44 (Bash 26, Read 18)	59 (shell 56, stdin 3)	3 (runshellcommand 3)
Design observations	4	2	0
False positives	0	0	0
Unique insights	5	0	3
Files reviewed	24	24	24
Coverage misses	2	2	0
Totals	5S	1S	1C 2I 1S
Downgraded	0	0	2 (I→S)
Dropped	0	0	0

Reconciliation: Gemini raised its C1 (Stream bypasses watchers.Wait) at Critical severity; consolidation downgraded to Important I1 because current callers (CLI one-shot commands) do not trigger the race — the finding is a latent contract violation. Gemini raised its I1 (swallowed tail errors) at Important; retained as Important I2 with the note that the controller still reconciles on phaseStopped, so the impact is diagnostic rather than wedging. Gemini's C2 (sendEvent panic) retained as Critical C1.

Gemini was the clear winner here. Its two concurrency findings were the highest-value output of the review; without them the review would have produced only polish-level suggestions on a PR whose whole purpose is to fix concurrency. Claude provided the polish coverage and a strong structural read. Codex contributed one finding already found by the other two. Running Opus and Gemini as the pair would have covered essentially the same ground at lower cost; Codex's value-add on this PR was negligible.

Review Process Notes ¶

Skill improvements ¶

Prompt agents to enumerate new concurrency edges introduced by a refactor. When a PR restructures goroutines (splits one goroutine into N, adds a sync.WaitGroup between two subsystems, turns a single-goroutine select into multiple), have the reviewer explicitly list (a) which shared state was previously protected by "one goroutine does everything" and is now protected by mutex/channel, (b) which close/delete operations that used to run in the same goroutine as their readers now run concurrently, and (c) which public API returns whose return now precedes a background shutdown that used to complete before return. The two missed concurrency findings on this review (C1, I1) both fall into category (b) and (c); agents that skipped the enumeration missed them.

Repo context updates ¶

None this pass — existing deep-review-context.md guidance was still accurate.