# Deep Review: 20260421-131403-pr-326

| | |
|---|---|
| **Date** | 2026-04-21 13:14 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 1 (of target) |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#326](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/326) — Fix deadlock in vendored nxadm/tail (observed as Windows CI hostagent-watcher hang) |
| **Commits** | `32015c9` tail/watch: synchronise watcher goroutine cleanup with Stop<br>`a2d5c9f` tail: resolve golangci-lint issues in the absorbed code<br>`92881d9` rdd: switch hostagent watcher to local tail-based events package<br>`e074dd9` tail/watch: fix deadlock in the shared InotifyTracker<br>`35d6f05` tail: rename pkg/util/nxadmtail to pkg/util/tail, merge with CLI wrapper<br>`a9bcf89` nxadmtail: absorb upstream watch/, util/, winfile/ subpackages |
| **Reviewers** | Claude Opus 4.7 (effort: xhigh), Codex GPT 5.4 (effort: high), Gemini 3.1 Pro (effort: default) |
| **Verdict** | **Merge with fixes** — gate the 200-cycle stress test behind an env flag so `make test` does not inherit a 7+ minute run on every CI lane; the rest are low-severity polish. |
| **Wall-clock time** | `19 min 7 s` |

## Executive Summary

The PR splits `InotifyTracker.run()` into three goroutines so the fsnotify `Events` and `Errors` drainers no longer share a `select` with synchronous `Add`/`Remove` RPCs, breaking the circular wait that hangs Windows CI after a `ReadDirectoryChanges` buffer overflow. A second, independent lifecycle race between `tail.Stop` and the `ChangeEvents` teardown goroutine is fixed by threading a per-`Tail` `sync.WaitGroup` through `FileWatcher.ChangeEvents` so `tailFileSync` waits for every `untrack` to land before `Done` fires. The new `TestInotifyTrackerNoDeadlockOnRepeatedRotation` reliably reproduces both races (subprocess writer, `GOMAXPROCS=1`, 200 rotate/tail/stop cycles). The rdd-side integration is a small `pkg/hostagent/events` package that mirrors the Lima JSON schema and routes through the fixed fork; after that swap, `nxadm/tail` is `// indirect`.

The fix is well-scoped, the commit history reads cleanly (each commit is one concept), and the root cause is correctly identified.

The one finding that should block merge-as-is: the stress test is `testing.Short()`-skippable but `make test` (the repo's normal unit-test entrypoint, also the command the `Unit tests` CI lane runs) does not pass `-short`, so the ~7-minute test now runs on every unit-test run on macos, ubuntu, and windows. Everything else is polish — stale `NOTICE` pointer, a stale doc comment, `events.Watch` using the global `logrus` instead of the controller `logr`, and a handful of carried-over upstream warts.

## Critical Issues

None.

## Important Issues

I1. **Stress test runs on the default unit-test lane on every CI run** — `pkg/util/tail/tail_stress_test.go:65-88`, `Makefile:158-160`, `.github/workflows/lint.yaml:70-85` [Codex GPT 5.4]

```go
func TestInotifyTrackerNoDeadlockOnRepeatedRotation(t *testing.T) {
    if testing.Short() {
        t.Skip("stress test")
    }
    ...
    const (
        cycles          = 200
        writerDuration  = 2 * time.Second
        cycleStopBudget = 10 * time.Second
    )
```

The test opts out only under `testing.Short()`. But `Makefile:159` defines `test: ... go$(EXE) test ./...` with no `-short`, and the `Unit tests` job in `.github/workflows/lint.yaml:85` invokes exactly that target on the `macos-latest`, `ubuntu-latest`, and `windows-latest` matrix legs. The PR description calls the ~7-minute run out as a *separate* long-running check, but nothing in the repo honors that distinction — every future `make test` (local and CI) now inherits 200 rotate/tail/stop cycles plus a subprocess spawn per cycle.

Fix: gate the test behind an explicit env flag so only the dedicated CI step opts in, and add that step to the workflow:

```diff
-    if testing.Short() {
-        t.Skip("stress test")
-    }
+    if testing.Short() || os.Getenv("RDD_TAIL_STRESS") != "1" {
+        t.Skip("set RDD_TAIL_STRESS=1 to run the stress test")
+    }
```

Then add a dedicated step/job: `RDD_TAIL_STRESS=1 go test -run TestInotifyTrackerNoDeadlockOnRepeatedRotation -timeout 1800s ./pkg/util/tail`.

Claude and Gemini both treated "`testing.Short()` skip" as sufficient; Codex traced the CI command line and correctly identified that it is not.

## Suggestions

S1. **`events.Watch` uses the global `logrus` instead of the controller `logr`** — `pkg/hostagent/events/events.go:68-129` [Codex GPT 5.4, Gemini 3.1 Pro]

```go
cfg := tail.Config{
    Follow:    true,
    ReOpen:    true,
    MustExist: false,
    Logger:    logrus.StandardLogger(),
}
...
if line.Err != nil {
    logrus.WithError(line.Err).Error("hostagent stdout tail error")
    continue
}
```

`runWatcher` builds a contextual logger at `hostagent_watcher.go:72` (`log.FromContext(ctx).WithValues("instance", name, "component", "watcher")`), and the embedded service routes controller logs through klog. The local `events.Watch` drops that context and emits through global `logrus` instead, so diagnostic lines on the watcher hot path lose their `instance`/`component` fields. The PR claims to *own* this code now (that was the whole point of committing 4), so there is no portability constraint to preserve.

Fix: accept a `logr.Logger` and route `tail.Config.Logger` through a small adapter, or set `tail.Config.Logger = tail.DiscardingLogger` and log errors directly through the passed-in `logr.Logger`.

```diff
-func Watch(ctx context.Context, haStdoutPath, haStderrPath string, begin time.Time, onEvent func(Event) bool) error {
+func Watch(ctx context.Context, logger logr.Logger, haStdoutPath, haStderrPath string, begin time.Time, onEvent func(Event) bool) error {
```

Gemini graded this `important`, Codex `suggestion`. The old path (Lima's `events.Watch`) also routed through `logrus`, so the PR does not introduce a regression — but it passed up the opportunity to fix it while the code was in transit.

S2. **`events.Watch` aborts permanently on a single unparseable stdout line** — `pkg/hostagent/events/events.go:110-112` [Gemini 3.1 Pro]

```go
if err := json.Unmarshal([]byte(line.Text), &ev); err != nil {
    return fmt.Errorf("failed to unmarshal %q as %T: %w", line.Text, ev, err)
}
```

If Lima's hostagent writes one non-JSON line to `ha.stdout.log` (a panic trace, a library preamble, a truncated write on a crash), `Watch` returns. `runWatcher` logs the error and exits; `state.done` closes. The reconciler does not restart the watcher, so the in-memory VM phase mirror stops updating until the next reconcile tears the whole instance down. The old Lima path had the same behavior, so this is not a regression introduced here, but moving the implementation in-tree makes it easy to fix now.

Fix: log and `continue` rather than returning.

```diff
-if err := json.Unmarshal([]byte(line.Text), &ev); err != nil {
-    return fmt.Errorf("failed to unmarshal %q as %T: %w", line.Text, ev, err)
-}
+if err := json.Unmarshal([]byte(line.Text), &ev); err != nil {
+    logger.Error(err, "Ignoring unparseable hostagent stdout line", "line", line.Text)
+    continue
+}
```

S3. **`Stream` swallows writer errors and ignores context on `follow=false`** — `pkg/util/tail/stream.go:30-45` [Gemini 3.1 Pro, Claude Opus 4.7]

```go
if !follow {
    go func() { _ = t.StopAtEOF() }()
}
go func() {
    <-ctx.Done()
    _ = t.Stop()
}()

for line := range t.Lines {
    fmt.Fprintln(writer, line.Text)
}
return nil
```

The pre-PR `tail.File` had the identical body; commit 35d6f05 renamed it and moved it to `stream.go` without changes. Two issues carried over:

1. `fmt.Fprintln` errors are discarded. In a pipe-close case (`rdd svc log | head`), the loop keeps consuming and discarding lines until the CLI's context cancels.
2. When `follow=false` the tomb has already been killed with `errStopAtEOF`; a subsequent `Kill(nil)` from the ctx goroutine is a no-op in `tomb.v1`. For a one-shot `rdd svc log` this is harmless (the CLI exits shortly after), but the ctx goroutine also leaks past the `for ... range t.Lines` return because nothing signals it to exit until ctx cancels.

Fix: check the write error, and give the ctx goroutine an exit signal tied to the main function returning:

```diff
+    tailDone := make(chan struct{})
+    defer close(tailDone)
     go func() {
-        <-ctx.Done()
-        _ = t.Stop()
+        select {
+        case <-ctx.Done():
+            _ = t.Stop()
+        case <-tailDone:
+        }
     }()

-    for line := range t.Lines {
-        fmt.Fprintln(writer, line.Text)
-    }
-    return nil
+    for line := range t.Lines {
+        if _, err := fmt.Fprintln(writer, line.Text); err != nil {
+            _ = t.Stop()
+            return err
+        }
+    }
+    return nil
```

Gemini graded this `important (regression)`, Claude `suggestion (pre-existing)`; verified against the base-branch `tail.File` — the body is byte-for-byte identical except for the `nxadmtail` package qualifier, so this is pre-existing.

S4. **`NOTICE` still points at `pkg/util/nxadmtail/`** — `NOTICE:9-10` [Codex GPT 5.4, Claude Opus 4.7]

```
- pkg/util/nxadmtail/: a vendored copy of github.com/nxadm/tail.
  See pkg/util/nxadmtail/LICENSE for details.
```

Commit 35d6f05 renamed the directory and moved `LICENSE` to `pkg/util/tail/LICENSE`, but `NOTICE` was not touched. The directory the file points at no longer exists. `pkg/util/tail/` also now contains three Apache-2.0 files (`stream.go`, `stream_test.go`, `tail_stress_test.go`) under what `NOTICE` describes as an MIT vendor boundary, so a blanket pointer at the directory is wrong even after the path fix.

Fix:

```diff
-- pkg/util/nxadmtail/: a vendored copy of github.com/nxadm/tail.
-  See pkg/util/nxadmtail/LICENSE for details.
+- pkg/util/tail/: the tail.go, tail_posix.go, tail_windows.go, watch/,
+  and winfile/ files are a vendored copy of github.com/nxadm/tail.
+  See pkg/util/tail/LICENSE for details.
```

S5. **`command.go:170` comment still references the removed `events.Watcher()`** — `pkg/hostagent/command.go:170` [Codex GPT 5.4]

```go
// JSON logs are parsed in pkg/hostagent/events.Watcher()
logrus.SetFormatter(new(logrus.JSONFormatter))
```

The exported symbol in the new `pkg/hostagent/events` package is `Watch`, not `Watcher`. The comment predates the local package but was not updated when commit 92881d9 introduced it with a different name.

Fix: `pkg/hostagent/events.Watch()`.

S6. **`FileWatcher.ChangeEvents` interface doc promises channel closes that the implementations do not perform** — `pkg/util/tail/watch/watch.go:22-25` [Codex GPT 5.4]

```go
// ChangeEvents reports on changes to a file, be it modification,
// deletion, renames or truncations. The returned FileChanges
// channels will be closed, thus become unusable, after a deletion
// or truncation event.
```

Neither `inotify.go:115-141` nor `polling.go:91-127` closes the `Modified`/`Truncated`/`Deleted` channels; they call `NotifyDeleted`/`NotifyTruncated` and return. The goroutine exits and no further notifications arrive, but the channels stay open. Before another consumer depends on the "closed" semantics, either fix the doc or fix the implementations.

Fix: update the comment to describe the actual behavior ("no further notifications arrive after a deletion or truncation event; the goroutine returns").

S7. **Polling watcher sleeps 250 ms before noticing `tomb.Dying`** — `pkg/util/tail/watch/polling.go:79-87` [Gemini 3.1 Pro]

```go
for {
    select {
    case <-t.Dying():
        return
    default:
    }

    time.Sleep(POLL_DURATION)
```

`time.Sleep` cannot be cancelled. `tail.Stop` now correctly synchronises on `watchers.Wait()`, so this sleep directly adds up to `POLL_DURATION` (250 ms today) to every service shutdown or log rotation when the polling backend is in use. Pre-existing upstream behavior, but worth a one-line fix while the code is being touched:

```diff
-    time.Sleep(POLL_DURATION)
+    select {
+    case <-t.Dying():
+        return
+    case <-time.After(POLL_DURATION):
+    }
```

S8. **`POLL_DURATION` ALL_CAPS / mutable-var rationale no longer applies** — `pkg/util/tail/watch/polling.go:33-37,134-136` [Claude Opus 4.7]

```go
//nolint:revive,staticcheck // Keep the upstream name for API compatibility.
var POLL_DURATION time.Duration
```

The nolint comment appeals to an external API that no longer exists: this is a private in-tree fork. Commit a2d5c9f already renamed other upstream identifiers (`Watch`→`track`, `Events`→`eventsFor`); this one was missed.

Fix: `const pollDuration = 250 * time.Millisecond`, drop the `init()`, update the two call sites.

S9. **Asymmetric "Do NOT call Cleanup" comment between the stdout and stderr defers** — `pkg/hostagent/events/events.go:80-92` [Claude Opus 4.7]

```go
defer func() {
    _ = haStdoutTail.Stop()
    // Do NOT call Cleanup; it unregisters the tracker entry in a way
    // that prevents the process from ever tailing the file again.
}()
...
defer func() {
    _ = haStderrTail.Stop()
}()
```

Both defers are structurally identical. The stdout one carries the upstream warning; the stderr one does not. A later editor patching the "missing `Cleanup()`" on the stderr half would silently reintroduce the bug the comment warns about. Upstream Lima has the note on both blocks.

Fix: mirror the comment, or hoist both into a local helper so there is one place to read the rule.

S10. **Stress-test subprocess should use `t.Context()` instead of `context.Background()`** — `pkg/util/tail/tail_stress_test.go:102` [Claude Opus 4.7]

```go
cmd := exec.CommandContext(context.Background(), selfExe, "-test.run=^$")
```

The comment correctly notes that stdin closure reaps the child. But if the test panics between `cmd.Start()` (line 108) and `cmd.Wait()` (line 124), the subprocess outlives the test. `t.Context()` lets `testing`'s cleanup kill the subprocess automatically on a failed/panicked run, which also covers the case where a future refactor reorders cycle teardown.

Fix:

```diff
-    cmd := exec.CommandContext(context.Background(), selfExe, "-test.run=^$")
+    cmd := exec.CommandContext(t.Context(), selfExe, "-test.run=^$")
```

S11. **`events.Watch` has no direct unit coverage** — `pkg/hostagent/events/events.go` [Codex GPT 5.4, Gemini 3.1 Pro]

The new local `events.Watch` is the integration boundary between `runWatcher` and the fixed tail library, but it has no tests of its own — no coverage for `begin`-time filtering, the "stop when `onEvent` returns true" contract, malformed JSON handling (S2), or stdout/stderr teardown ordering. `TestInotifyTrackerNoDeadlockOnRepeatedRotation` exercises the underlying tail, not this adapter. Any future change to the parsing loop would go in blind.

Fix: add a small test that feeds a temp file through `Watch` and verifies each of the above contracts.

## Design Observations

### Concerns

- `(future)` **Per-file event channels are unbuffered; a slow consumer stalls the process-wide events drainer** — [Gemini 3.1 Pro]. `InotifyTracker.sendEvent` (`inotify_tracker.go:208-222`) sends to the per-file channel with a `<-done` escape, but if the tail is alive and its `ChangeEvents` goroutine is slow (e.g. blocked in the `os.Stat` call on a `Write` event), the main `watcher.Events` drainer blocks on that send and delivery stalls for every other tail in the process. Not introduced by this PR; worth considering if the number of concurrent tails grows.

### Strengths

- `(in-scope)` **Three-goroutine split preserves the single-`fsnotify.Watcher` design** — `pkg/util/tail/watch/inotify_tracker.go:245-287` [Codex GPT 5.4, Claude Opus 4.7]. The fix addresses the deadlock without giving back the `fs.inotify.max_user_instances` budget win of the shared tracker.
- `(in-scope)` **`sync.WaitGroup` handoff is a precise fix for the lifecycle race** — `pkg/util/tail/tail.go:100-104,268-271`, `pkg/util/tail/watch/inotify.go:83-94`, `pkg/util/tail/watch/polling.go:60-77` [Codex GPT 5.4, Claude Opus 4.7, Gemini 3.1 Pro]. Defer order (close → watchers.Wait → Done) means `tail.Stop` returns only after every `untrack` has landed, closing the stale-tracker window.
- `(in-scope)` **Clean commit structure with surgical bisect points** — [Claude Opus 4.7]. Commit 1 is a pure code-move, 2 is a pure rename, 3 is the deadlock fix with a reproducer, 4 is the consumer swap, 5 is lint-only cleanup, 6 is the second-race fix that the first fix's stress test uncovered. No fixup commits.
- `(in-scope)` **In-code documentation of the root cause** — `pkg/util/tail/watch/inotify_tracker.go:224-244` [Claude Opus 4.7]. The `run()` doc describes the blocked-channel chain and why the split breaks it; someone revisiting this in two years does not need to spelunk through git history.

## Testing Assessment

`TestInotifyTrackerNoDeadlockOnRepeatedRotation` reproduces the precise CI failure pattern: 200 rotate/tail/stop cycles with an external subprocess writer generating enough `ReadDirectoryChanges` traffic to trigger fsnotify's "buffer larger than it is" sendError path on Windows. The same test under Linux `GOMAXPROCS=1` catches the commit-6 lifecycle race (`received 0 events` after ~130 cycles). It is `testing.Short()`-skippable, but see I1 for the gating problem. The pre-existing `TestStream` covers both follow and non-follow modes.

Gaps, in rough order of risk:

1. **No direct tests for `pkg/hostagent/events.Watch`** (S11) — the new adapter between `runWatcher` and the fixed tail has zero coverage of its own.
2. **No coverage for concurrent `Tail`s on the same filename** — `InotifyTracker` reuses `chans[fname]` across tails, and a stop-then-reopen race would still exist if a second consumer appeared. Not currently exercised by rdd; worth a note.
3. **`fsnotify.NewWatcher()` error path panics from a detached goroutine** — `inotify_tracker.go:246-249`. Behaviorally equivalent to upstream's `util.Fatal`, no test, but hard to inject.
4. **No controller-level test exercising `runWatcher` + `events.Watch` across rotated `ha.stdout.log`/`ha.stderr.log`** — the stress test proves the tail layer, not the adapter the reconciler calls.

## Documentation Assessment

- Package docs on `pkg/util/tail/tail.go:8-17`, `pkg/hostagent/events/events.go:6-10`, `pkg/util/tail/watch/watch.go:8-9` are accurate.
- `NOTICE` points at a removed directory (S4).
- `pkg/hostagent/command.go:170` names a symbol that no longer exists (S5).
- `FileWatcher.ChangeEvents` interface doc promises channel closes that the implementations do not perform (S6).
- `POLL_DURATION`'s nolint comment cites API compatibility that no longer applies (S8).
- `events.Watch` has no docstring about its failure semantics — a one-line note that an unparseable line terminates the watch would document S2's current behavior until it is changed.

## Commit Structure

Clean. Each commit is one concept; commit 6 explicitly credits the stress test from commit 3 for exposing the race it fixes; commit 5 isolates lint-only cleanup from the structural changes it polishes. Bisect would land surgically on each. No fixup commits.

## Acknowledged Limitations

- **Comment: `pkg/hostagent/events/events.go:82-84`** — "Do NOT call `Cleanup`; it unregisters the tracker entry in a way that prevents the process from ever tailing the file again." Correctly describes a pre-existing bug in the vendored `InotifyTracker` where `removeWatch` decrements `watchNums` unconditionally, so a stray `Cleanup()` after `ChangeEvents` has already untracked leaves the counter at −1 and every subsequent `addWatch` increments to 0 without calling `watcher.Add`. Not fixed here; S9 asks only for symmetry between the stdout and stderr defers.
- **Comment: `pkg/util/tail/watch/inotify.go:121`** — `// With an open fd, unlink(fd) - inotify returns IN_ATTRIB (==fsnotify.Chmod)`. Documents an intentional fsnotify quirk handler. Not actionable.
- **Comment: `pkg/util/tail/watch/inotify.go:131-132`** and **`polling.go:97-98`** — `// XXX: report this error back to the user` on the `panic` path after a non-`IsNotExist` stat error. Pre-existing in upstream; unchanged here.
- **Comment: `pkg/util/tail/watch/polling.go:70-71`** — `// XXX: use tomb.Tomb to cleanly manage these goroutines. replace the fatal (below) with tomb's Kill.` Carried from upstream; the commit-6 WaitGroup fix addresses a different concern.
- **Comment: `pkg/controllers/lima/limavm/controllers/hostagent_watcher.go:101-104`** — `ev.Status.Degraded is intentionally not handled here`. Consistent with the upstream event-set subsetting this PR already does in `pkg/hostagent/events`.
- **PR description — "Out of scope: Upstreaming the fix"** — author explicitly defers upstreaming with rationale (nxadm/tail unmaintained since 2019; fsnotify's unbuffered `Errors` channel is the underlying hazard). Accepted.
- **PR description — "Out of scope: Converting Lima's `hostagent/events.Watch`"** — the rdd strategy here is to bypass Lima's events watcher rather than fix it upstream. Accepted.

## Agent Performance Retro

### [Claude]

Claude produced the most file-level polish (the `NOTICE` pointer, the `POLL_DURATION` nolint stale rationale, the asymmetric stop comment, the `t.Context()` suggestion) and wrote the strongest framing of the fix's *shape* — the commit-structure callout and the `run()` root-cause annotation both came from its read. It missed I1 outright, though: it listed the `testing.Short()` skip as a *strength* without tracing what `make test` actually runs, and so inverted the finding. Its severity bar was conservative — no important findings, which lines up with Codex but trails Gemini's two elevated calls.

### [Codex]

Codex owned I1, the only merge-blocker in the review, and followed the chain from `tail_stress_test.go` through `Makefile:159` to the `.github/workflows/lint.yaml` unit-test job — exactly the kind of cross-file verification the other two agents skipped. It also caught the dead `events.Watcher()` comment in `command.go:170` and the `FileWatcher.ChangeEvents` interface-doc drift, both real. Its S2 `NOTICE` suggestion overlapped with Claude's S1 but with the narrower fix; Claude's wording made it into the consolidated report because it also addressed the Apache-2.0 file mix. Its one build-time friction note (`go build ./...` failed on the missing guestagent binary) is noise from a clean worktree without build artifacts, not a PR concern.

### [Gemini]

Gemini raised two findings the other agents missed — the malformed-JSON abort (S2) and the unresponsive `time.Sleep` in the polling watcher (S7), both valid. Its severity calibration ran hot: I1 (Stream write errors) and I2 (logrus) were both tagged `important (regression)`, but verification against the base branch shows they are pre-existing. I downgraded both to suggestions. Its design-concern callout on the process-wide events drainer is a genuinely useful note for future work even if not in scope. `git blame` usage: Gemini did not verify its "regression" tags against the base branch, which is a recurring pattern consistent with its daily quota limits documented in the retro guide.

### Summary

| | Claude Opus 4.7 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 13m 13s | 8m 10s | 6m 03s |
| Findings | 5S | 1I 2S | 1I 3S |
| Tool calls | 44 (Read 21, Bash 18, Grep 5) | 61 (shell 59, stdin 2) | 14 (read_file 6, grep_search 4, run_shell_command 4) |
| Design observations | 5 | 3 | 2 |
| False positives | 0 | 0 | 0 |
| Unique insights | 4 | 3 | 2 |
| Files reviewed | 20 | 20 | 20 |
| Coverage misses | 0 | 0 | 0 |
| **Totals** | **5S** | **1I 2S** | **1I 3S** |
| Downgraded | 0 | 0 | 2 (I→S) |
| Dropped | 0 | 0 | 0 |


**Overall**: Codex delivered the single most valuable finding (I1), which changed the verdict. Gemini uncovered the most unique scenarios (two suggestions no other agent flagged). Claude produced the most polish and had the cleanest file-by-file coverage. Each of the three would have left one or more findings on the table alone; together they covered the PR well.

## Review Process Notes

### Skill improvements

- **Verify "regression" tags against the base branch before accepting them.** When a finding describes a behavior as "regression", cross-check that the offending code did not exist (or behaved differently) before the PR. Renames, moves, and refactors often carry pre-existing bugs without introducing them. Why: severity inflation from mislabeling pre-existing issues as regressions shifts the review's focus away from what the PR actually changed.

- **Trace how each test actually runs in CI, not just whether it has an opt-out.** A finding that "the test is `testing.Short()`-skippable" is not the same as "the test does not run by default" — check the `Makefile` target the CI workflow invokes and the flags passed. Why: a test gated on an opt-out that CI does not set adds its full runtime to every CI run across the OS matrix, which is a real merge-blocking issue that reads as "test is well-guarded" on a surface review.

### Repo context updates

None. The existing `deep-review-context.md` entries on embedded-apiserver lifecycle, cross-controller tracing, and repository layout were accurate for this review — no changes under review contradicted them, and no new repo conventions surfaced that would improve future reviews.

## Appendix: Original Reviews

### [Claude Opus 4.7]

# Deep Code Review: PR #326 — Fix deadlock in vendored nxadm/tail

| | |
|---|---|
| **Verdict** | Approve with minor cleanup |
| **Scope** | 6 commits, ~1600 lines moved + ~300 lines of new/modified logic |
| **Risk** | Low. Touches a narrow subsystem already broken on Windows CI; the fix targets the exact deadlock site with a focused structural change and ships a reproducing test. |

## Executive Summary

The PR splits a single goroutine that was servicing synchronous `fsnotify` Add/Remove RPCs alongside the Events/Errors drainers, into three goroutines — breaking the circular wait that Windows CI hit. A second, independent lifecycle race (goroutine exits after `Tomb.Wait` returns) is fixed by adding a `sync.WaitGroup` the tail's `FileWatcher` increments before spawning its goroutine. Both changes are well-scoped, the root cause is correctly identified, and the new `TestInotifyTrackerNoDeadlockOnRepeatedRotation` stress test exercises the exact pattern `LimaVMReconciler` uses. I found one small regression (stale `NOTICE` pointer) and a few low-severity nits.

## Findings

### Critical Issues

None.

### Important Issues

None.

### Suggestions

#### S1. `NOTICE` still references the old `pkg/util/nxadmtail/` path — `NOTICE:9-10` (suggestion, regression)

```
- pkg/util/nxadmtail/: a vendored copy of github.com/nxadm/tail.
  See pkg/util/nxadmtail/LICENSE for details.
```

Commit 35d6f05 renamed `pkg/util/nxadmtail/` to `pkg/util/tail/` and moved `LICENSE` along with it, but `NOTICE` was not updated. The directory the file points to no longer exists, so anyone following this reference to understand the license boundary hits a dead end. `pkg/util/tail/` also now contains *new* files under Apache-2.0 (`stream.go`, `stream_test.go`, `tail_stress_test.go`), so the MIT-vendored subset is narrower than the whole directory — the NOTICE should call out the specific files.

Fix:

```diff
-- pkg/util/nxadmtail/: a vendored copy of github.com/nxadm/tail.
-  See pkg/util/nxadmtail/LICENSE for details.
+- pkg/util/tail/: the tail.go, tail_posix.go, tail_windows.go, watch/,
+  and winfile/ files are a vendored copy of github.com/nxadm/tail.
+  See pkg/util/tail/LICENSE for details.
```

---

#### S2. `POLL_DURATION` "API compatibility" rationale no longer applies — `pkg/util/tail/watch/polling.go:33-37` (suggestion, enhancement)

```go
// POLL_DURATION is the interval at which a PollingFileWatcher re-stats
// the file it watches.
//
//nolint:revive,staticcheck // Keep the upstream name for API compatibility.
var POLL_DURATION time.Duration
```

The nolint comment references preserving an external API, but this is now an in-tree private fork — the name is only referenced by `polling.go` itself (line 48 and 86). The ALL_CAPS name and mutable package-level variable add noise that the linter would otherwise catch. Since commit 5 was an explicit lint-cleanup pass that already renamed other upstream-sourced identifiers (`Watch`→`track`, `Events`→`eventsFor`), this one was missed.

Fix: rename to `pollDuration` as an unexported `const` — no external consumer prevents it:

```diff
-// POLL_DURATION is the interval at which a PollingFileWatcher re-stats
-// the file it watches.
-//
-//nolint:revive,staticcheck // Keep the upstream name for API compatibility.
-var POLL_DURATION time.Duration
+// pollDuration is the interval at which a PollingFileWatcher re-stats
+// the file it watches.
+const pollDuration = 250 * time.Millisecond
```

Then drop the `init()` block at lines 134-136 and update the two call sites at lines 48 and 86.

---

#### S3. Asymmetric stop-comment in `events.Watch` — `pkg/hostagent/events/events.go:80-92` (suggestion, enhancement)

```go
defer func() {
    _ = haStdoutTail.Stop()
    // Do NOT call Cleanup; it unregisters the tracker entry in a way
    // that prevents the process from ever tailing the file again.
}()

haStderrTail, err := tail.Open(haStderrPath, cfg)
if err != nil {
    return err
}
defer func() {
    _ = haStderrTail.Stop()
}()
```

The stdout defer carries the "Do NOT call Cleanup" note; the stderr defer does not. Both are structurally identical — either both need the warning so a later editor doesn't add `Cleanup()` on a hunch, or neither does. Upstream Lima has the identical note on both blocks (`lima v2.1.1 .../hostagent/events/watcher.go:31-47`).

Fix: mirror the comment onto the stderr defer, or hoist both into a tiny `stopTail(t)` helper so there is only one place to read the rule.

---

#### S4. `Stream` leaks a goroutine when `follow=false` completes before `ctx` cancels — `pkg/util/tail/stream.go:36-39` (suggestion, gap)

```go
go func() {
    <-ctx.Done()
    _ = t.Stop()
}()

for line := range t.Lines {
    fmt.Fprintln(writer, line.Text)
}
return nil
```

When `follow=false`, the tail reaches EOF, `t.StopAtEOF()` (spawned at line 32-34) completes, `t.Lines` closes, the range ends, and `Stream` returns. The `<-ctx.Done()` goroutine is still blocked until the caller cancels `ctx`. For a one-shot command like `rdd service log` (no `-f`), that happens when cobra's context ends at process exit — not a leak in practice, but a leak in principle. The behavior predates this PR (it was in the old `pkg/util/tail/tail.go` wrapper under the same logic), so not strictly a regression, but commit 35d6f05 moved the code without fixing it.

Fix: add a tail-exit channel so the ctx watcher can bail:

```diff
+    tailDone := make(chan struct{})
+    defer close(tailDone)
     go func() {
-        <-ctx.Done()
-        _ = t.Stop()
+        select {
+        case <-ctx.Done():
+            _ = t.Stop()
+        case <-tailDone:
+        }
     }()
```

---

#### S5. Stress test uses `context.Background()` where `t.Context()` would tie subprocess cleanup to the test — `pkg/util/tail/tail_stress_test.go:102` (suggestion, enhancement)

```go
cmd := exec.CommandContext(context.Background(), selfExe, "-test.run=^$") // no-op test selector; TestMain branches on env
```

The justifying comment (lines 99-101) is correct — stdin closure reaps the child and the cycle logic force-kills if `Wait` doesn't return. But if the test itself panics between `cmd.Start()` (line 108) and `cmd.Wait()` (line 124), the subprocess outlives the test. `t.Context()` would let `testing`'s cleanup kill the subprocess automatically after a failed test.

Fix:

```diff
-    cmd := exec.CommandContext(context.Background(), selfExe, "-test.run=^$") // no-op test selector; TestMain branches on env
+    cmd := exec.CommandContext(t.Context(), selfExe, "-test.run=^$") // no-op test selector; TestMain branches on env
```

The existing stdin-close and timeout logic is unchanged; this adds a backstop for the panic path.

---

## Design Observations

### Strengths

- The commit split is exemplary: commit 1 is a pure code-move (diff reads 0 logical changes), commit 2 is a pure rename/merge, commit 3 is the deadlock fix isolated from lint noise, commit 4 swaps the consumer, commit 5 does lint-only cleanup, commit 6 fixes the second race that the first fix revealed. Bisect would land surgically on each.
- The `run()` comment at `pkg/util/tail/watch/inotify_tracker.go:224-244` describes the deadlock, the chain of blocked channel operations, and *why the split fixes it*. Anyone revisiting this in two years will understand the reasoning without spelunking through git history.
- Keeping the single shared `fsnotify.Watcher` (as opposed to one per tail) is the right call for `fs.inotify.max_user_instances` budget — the PR description explicitly explains why this constraint motivated the original design.
- The stress test honours `testing.Short()` (line 66-68) so `go test -short ./...` isn't blocked by a 7-minute run, but `go test ./...` in CI catches regressions.

### Concerns

None that warrant redesign. The concurrent-tail-on-same-path race I considered (two live `Tail`s watching the same filename where one's `Stop` interleaves with the other's `track`) is not exercised by rdd — every rdd caller uses distinct paths (`ha.stdout.log` and `ha.stderr.log` per instance, different instances live in different directories). The WaitGroup fix in commit 6 is sufficient for the sequential stop-then-reopen pattern that rdd does use. `(future)`

---

## Testing Assessment

Coverage for the new code is strong. `TestInotifyTrackerNoDeadlockOnRepeatedRotation` reproduces the precise CI failure pattern (200 rotate/tail/stop cycles with an external-process writer to trigger Windows `ReadDirectoryChanges` buffer overflow) and also catches the commit-6 lifecycle race when run under Linux `GOMAXPROCS=1`. The existing `TestStream` covers both follow and non-follow modes.

Untested scenarios, in rough order of risk:

1. Concurrent `Tail` instances watching the same filename (the race I described in Design Observations). Not exercised by rdd but the code path exists — if a future change adds a second consumer, the race re-emerges.
2. `fsnotify.NewWatcher()` returning an error — the code panics at `pkg/util/tail/watch/inotify_tracker.go:247-249` from inside a detached goroutine, crashing the process. Behaviorally equivalent to upstream's `util.Fatal`, but no test verifies the error path (hard to inject).
3. `events.Watch` returning early because `onEvent` returned `true` while the tail goroutine is blocked sending on `Lines` (the pre-existing cooldown-message unconditional send at `pkg/util/tail/tail.go:317`). Inherited from upstream nxadm/tail; rdd doesn't hit the rate-limit path in practice.

---

## Documentation Assessment

- Package doc on `pkg/util/tail/tail.go:8-17` accurately describes the history and current status. Good.
- Package doc on `pkg/hostagent/events/events.go:6-10` correctly names the upstream it replaces and the reason — matches what an implementer would need to know.
- Package doc on `pkg/util/tail/watch/watch.go:8-9` is sparse but accurate.
- `NOTICE` entry is stale — see S1.
- The `POLL_DURATION` nolint comment misrepresents the remaining motivation — see S2.

---

## Commit Structure

Clean. Each commit is a single concept, commit 6 explicitly documents that it fixes a race exposed *by* testing the commit-3 fix, and commit 5 is an isolated lint-only pass rather than being folded into the commits it cleans up (which would conflate structural change with cosmetic change). No fixup commits.

---

## Acknowledged Limitations

- **`pkg/util/tail/watch/polling.go:70-71`** — `// XXX: use tomb.Tomb to cleanly manage these goroutines. replace the fatal (below) with tomb's Kill.` Carried from upstream nxadm/tail; this PR does not change the behavior. The commit-6 WaitGroup fix addresses a different lifecycle concern (untrack ordering), not this one.
- **`pkg/util/tail/watch/inotify.go:131-132`** — `// XXX: report this error back to the user` on the `panic` path after a non-`IsNotExist` stat error. Pre-existing in upstream; unchanged here.
- **`pkg/util/tail/watch/inotify.go:121`** — `// With an open fd, unlink(fd) - inotify returns IN_ATTRIB (==fsnotify.Chmod)`. Documentation of an fsnotify quirk that the code intentionally handles; not actionable.
- **`pkg/hostagent/events/events.go:82-84`** — `// Do NOT call Cleanup; it unregisters the tracker entry in a way that prevents the process from ever tailing the file again.` Justified and carried from upstream Lima; S3 just asks for symmetry between stdout and stderr defers, not removal.
- **`pkg/util/tail/tail.go:313-322`** — the cooldown-message unconditional send to `tail.Lines` predates this PR and is inherited from upstream. Only fires under rate-limiting, which rdd's workload doesn't hit.
- **PR description — "Out of scope: Upstreaming the fix"** — author explicitly defers upstreaming with rationale (nxadm/tail unmaintained since 2019, fsnotify's unbuffered `Errors` channel is the underlying hazard). Accepted.
- **PR description — "Out of scope: Converting Lima's `hostagent/events.Watch`"** — author explicitly notes the rdd approach is to bypass Lima's events watcher rather than fix it upstream. Accepted.

---

## Coverage Summary

| File | Status |
|---|---|
| `Makefile` | Reviewed, no issues (LTAG_EXCLUDES updated to match the new paths; the new `stream.go`/`stream_test.go`/`tail_stress_test.go` are Apache-2.0 and correctly NOT excluded) |
| `cmd/rdd/limavm.go` | Trivial (function rename `tail.File` → `tail.Stream`) |
| `cmd/rdd/service.go` | Trivial (function rename `tail.File` → `tail.Stream`) |
| `go.mod` | Reviewed, no issues (`fsnotify` promoted to direct; `nxadm/tail` demoted to indirect — both correct for the post-refactor import graph) |
| `pkg/controllers/lima/limavm/controllers/hostagent_watcher.go` | Reviewed, no issues (import swap + `propagateStderr` parameter drop; onEvent callsite unchanged) |
| `pkg/hostagent/events/events.go` | S3 |
| `pkg/util/nxadmtail/tail.go` (deleted) | Trivial (moved to `pkg/util/tail/tail.go`) |
| `pkg/util/tail/LICENSE` | Trivial (rename from `pkg/util/nxadmtail/LICENSE`) |
| `pkg/util/tail/stream.go` | S4 |
| `pkg/util/tail/stream_test.go` | Trivial (function rename in test body and `TestFile`→`TestStream`) |
| `pkg/util/tail/tail.go` | Reviewed, no issues (new `watchers sync.WaitGroup` field + `defer tail.watchers.Wait()` + `partitionString` inlined from upstream `util`; `File`→`Open` rename is internally consistent) |
| `pkg/util/tail/tail_posix.go` | Trivial (package rename only) |
| `pkg/util/tail/tail_stress_test.go` | S5 |
| `pkg/util/tail/tail_windows.go` | Trivial (package rename + import path fix) |
| `pkg/util/tail/watch/filechanges.go` | Reviewed, no issues (pure port from upstream) |
| `pkg/util/tail/watch/inotify.go` | Reviewed, no issues (`ChangeEvents` signature now takes `*sync.WaitGroup` and the goroutine correctly `wg.Add(1)` before spawn + `defer wg.Done()`; all four exit paths call `untrack` before return) |
| `pkg/util/tail/watch/inotify_tracker.go` | Reviewed, no issues (the deadlock fix itself; the three-goroutine split is correct, `addWatch`/`removeWatch` continue to serialize through the RPC loop, `chans`/`done` map mutations remain under `mux`) |
| `pkg/util/tail/watch/polling.go` | S2 |
| `pkg/util/tail/watch/watch.go` | Reviewed, no issues (interface updated to carry `*sync.WaitGroup`; doc comment explains why) |
| `pkg/util/tail/winfile/winfile.go` | Reviewed, no issues (pure port from upstream; docstrings added for `Open`/`OpenFile` clarify the Windows share-mode rationale) |
| `NOTICE` (unmodified) | S1 (should be modified but was missed) |
---

### [Codex GPT 5.4]

### Executive Summary

This PR vendors the `nxadm/tail` stack in-tree, fixes the shared `InotifyTracker` deadlock/lifecycle race, and switches the LimaVM hostagent watcher to the local fork. The watcher logic itself looks sound, but the new 200-cycle stress test is now part of the default `make test` path, which turns ordinary unit-test runs into multi-minute stress jobs. I also found two lower-severity follow-ups: the new hostagent watcher path bypasses the service logger, and the vendored-license notice still points at the removed `pkg/util/nxadmtail` path.

### Findings

**Critical Issues**

None.

**Important Issues**

I1. **Stress test now runs in the default unit-test target** — `pkg/util/tail/tail_stress_test.go:65-87` (important, regression)

```go
func TestInotifyTrackerNoDeadlockOnRepeatedRotation(t *testing.T) {
	if testing.Short() {
		t.Skip("stress test")
	}
	...
	const (
		cycles          = 200
		writerDuration  = 2 * time.Second
		cycleStopBudget = 10 * time.Second
	)
```

`TestInotifyTrackerNoDeadlockOnRepeatedRotation` only opts out under `testing.Short()` at `pkg/util/tail/tail_stress_test.go:66-68`, but the repo’s normal unit-test entrypoint still runs plain `go test ./...` at `Makefile:158-159`, and the existing unit-test workflow invokes that target at `.github/workflows/lint.yaml:70-84`. Because the new test hard-codes 200 rotation cycles at `pkg/util/tail/tail_stress_test.go:73-75`, every routine `make test` now inherits the same multi-minute stress workload that the PR description explicitly calls out as a separate long-running check. This will slow local iteration and the default unit-test lane on all three OSes even when developers are not touching the tail watcher.

Fix:

```diff
- if testing.Short() {
-     t.Skip("stress test")
- }
+ if testing.Short() || os.Getenv("RDD_TAIL_STRESS") != "1" {
+     t.Skip("set RDD_TAIL_STRESS=1 to run the stress test")
+ }
```

Then add a dedicated CI step/job for `RDD_TAIL_STRESS=1 go test -run TestInotifyTrackerNoDeadlockOnRepeatedRotation -timeout 1800s ./pkg/util/tail`.

**Suggestions**

S1. **Service-side watcher logs bypass the controller logger** — `pkg/hostagent/events/events.go:68-126` (suggestion, regression)

```go
cfg := tail.Config{
	Follow:    true,
	ReOpen:    true,
	MustExist: false,
	Logger:    logrus.StandardLogger(),
}
...
if line.Err != nil {
	logrus.WithError(line.Err).Error("hostagent stdout tail error")
}
```

`runWatcher` builds a contextual controller logger with instance metadata at `pkg/controllers/lima/limavm/controllers/hostagent_watcher.go:72`, and the embedded service configures controller-runtime to use `klog` at `pkg/service/controllers/manager.go:97,105-106`. The new local `events.Watch` introduced here sends its diagnostics through the global `logrus` logger at `pkg/hostagent/events/events.go:73,103,113,125` instead. That drops the `instance`/`component` fields on the exact watcher path this PR is trying to make diagnosable, and it bypasses the service’s normal logging backend.

Fix:

```diff
-func Watch(ctx context.Context, haStdoutPath, haStderrPath string, begin time.Time, onEvent func(Event) bool) error {
+func Watch(ctx context.Context, logger logr.Logger, haStdoutPath, haStderrPath string, begin time.Time, onEvent func(Event) bool) error {
     cfg := tail.Config{
         Follow:    true,
         ReOpen:    true,
         MustExist: false,
-        Logger:    logrus.StandardLogger(),
+        Logger:    tail.DiscardingLogger,
     }
...
-    logrus.WithError(line.Err).Error("hostagent stdout tail error")
+    logger.Error(line.Err, "Hostagent stdout tail error")
```

If you still want the vendored tail library’s reopen messages, add a small adapter from `logr.Logger` to the `tail.Config.Logger` interface instead of using the global logger.

S2. **The vendored-license notice still points at the removed `nxadmtail` path** — `NOTICE:9-10` (suggestion, regression)

```text
- pkg/util/nxadmtail/: a vendored copy of github.com/nxadm/tail.
  See pkg/util/nxadmtail/LICENSE for details.
```

The stale text at `NOTICE:9-10` predates this PR, but the regression is introduced here: the vendored tree was moved to `pkg/util/tail/`, `pkg/util/nxadmtail/` no longer exists, and the new license file is at `pkg/util/tail/LICENSE:1-21`. Anyone auditing third-party license attributions from the source tree now gets a dead path.

Fix:

```diff
- - pkg/util/nxadmtail/: a vendored copy of github.com/nxadm/tail.
-   See pkg/util/nxadmtail/LICENSE for details.
+ - pkg/util/tail/: a vendored copy of github.com/nxadm/tail.
+   See pkg/util/tail/LICENSE for details.
```

### Design Observations

**Concerns**

- `(in-scope)` `pkg/hostagent/events.Watch` is now RDD-owned infrastructure on a controller hot path, but it still hard-wires a global logger. Passing the caller’s logger through this boundary would remove an unnecessary service/CLI split, preserve per-instance context, and make future watcher failures easier to trace without grep-heavy log forensics.

**Strengths**

- Keeping a single process-wide `fsnotify.Watcher` in `pkg/util/tail/watch/inotify_tracker.go:21-35` while splitting event/error draining from Add/Remove RPC handling is the right fix shape. It addresses the deadlock without giving back the Linux inotify-instance budget benefit of the shared design.
- The `sync.WaitGroup` handoff between `tail.waitForChanges()` and `tailFileSync()` in `pkg/util/tail/tail.go:268-271,388-395` is a precise fix for the lifecycle race. It closes the stale-watch window without broad refactoring.

### Testing Assessment

1. `pkg/hostagent/events.Watch` has no direct unit coverage for malformed JSON, `begin` filtering, or the “stop when `onEvent` returns true” contract in `pkg/hostagent/events/events.go:55-129`, even though RDD now owns this implementation instead of Lima.
2. There is no dedicated CI lane for the long stress regression test described in the PR text. Right now the default unit-test lane runs it accidentally (I1), rather than tracking it intentionally.
3. There is still no controller-level test that exercises the real `runWatcher` + local `pkg/hostagent/events` path across rotated `ha.stdout.log` / `ha.stderr.log` files; the new stress test proves `tail`, not the adapter that the LimaVM reconciler actually calls.

Validation run: `go test -short ./pkg/util/tail/... ./pkg/hostagent/events ./pkg/controllers/lima/...` passed. `go build ./...` in this checkout failed before reaching the reviewed code because `pkg/guestagent/lima-guestagent.gz` is missing.

### Documentation Assessment

- `pkg/hostagent/command.go:170` still says `pkg/hostagent/events.Watcher()` even though the exported symbol is now `Watch`.
- `pkg/util/tail/watch/watch.go:23-26` says `FileChanges` channels are closed after delete/truncate, but the implementations at `pkg/util/tail/watch/inotify.go:117-119,127-129` and `pkg/util/tail/watch/polling.go:93-104` only notify and return. The interface comment should describe the real contract before more callers depend on it.

### Acknowledged Limitations

- **Code comment**: `pkg/controllers/lima/limavm/controllers/hostagent_watcher.go:101-104` — “`ev.Status.Degraded` is intentionally not handled here...”. This PR is consistent with that rationale; the local watcher mirrors the same subset of Lima’s event schema, so the limitation is unchanged.
- **Code comment**: `pkg/hostagent/events/events.go:82-83` — “Do NOT call Cleanup; it unregisters the tracker entry in a way that prevents the process from ever tailing the file again.” This change reduces exposure by routing RDD through the fixed local fork, but it still documents a real inherited limitation.
- **Code comment**: `pkg/util/tail/watch/inotify.go:131-132` and `pkg/util/tail/watch/polling.go:97-98` — “XXX: report this error back to the user”. Vendoring did not worsen this, but the underlying error-propagation gap is still acknowledged in both watcher implementations.

### Coverage Summary

- `Makefile` — Reviewed, no issues
- `cmd/rdd/limavm.go` — Reviewed, no issues
- `cmd/rdd/service.go` — Reviewed, no issues
- `go.mod` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/hostagent_watcher.go` — S1
- `pkg/hostagent/events/events.go` — S1
- `pkg/util/nxadmtail/tail.go` — S2
- `pkg/util/tail/LICENSE` — S2
- `pkg/util/tail/stream.go` — Reviewed, no issues
- `pkg/util/tail/stream_test.go` — Reviewed, no issues
- `pkg/util/tail/tail.go` — Reviewed, no issues
- `pkg/util/tail/tail_posix.go` — Reviewed, no issues
- `pkg/util/tail/tail_stress_test.go` — I1
- `pkg/util/tail/tail_windows.go` — Reviewed, no issues
- `pkg/util/tail/watch/filechanges.go` — Reviewed, no issues
- `pkg/util/tail/watch/inotify.go` — Reviewed, no issues
- `pkg/util/tail/watch/inotify_tracker.go` — Reviewed, no issues
- `pkg/util/tail/watch/polling.go` — Reviewed, no issues
- `pkg/util/tail/watch/watch.go` — Reviewed, no issues
- `pkg/util/tail/winfile/winfile.go` — Reviewed, no issues

---

### [Gemini 3.1 Pro]

### Executive Summary

This PR successfully resolves a deadlock in the vendored `nxadm/tail` library by decoupling the `fsnotify.Events` and `fsnotify.Errors` drainers from the synchronous tracker RPC loop, and corrects a related lifecycle race during goroutine teardown. The core fix is robust and comprehensively tested. However, the newly introduced `events.Watch` utility improperly uses CLI-oriented loggers within a long-running controller context and exits permanently upon encountering invalid JSON. Additionally, the `Stream` wrapper ignores write errors and can leak goroutines on context cancellation.

### Findings

#### Critical Issues

None.

#### Important Issues

I1. **Stream ignores write errors and context cancellation when follow=false** — `pkg/util/tail/stream.go:41-43` (important, regression)

```go
	for line := range t.Lines {
		fmt.Fprintln(writer, line.Text)
	}
```
`Stream` ignores the error returned by `fmt.Fprintln`. If the output pipe is closed (e.g., a user runs `rdd svc logs | head`), the loop blindly consumes and discards the rest of the log file without exiting. 

Furthermore, `Stream` fails to honor `ctx` cancellation when `follow=false`. In that mode, `StopAtEOF` sets the tomb's error state to `errStopAtEOF`. When `ctx` is later cancelled, `Stream` calls `t.Stop()` which issues a `Kill(nil)` — but the tomb ignores this because `errStopAtEOF` is already locked in. Consequently, `tailFileSync` forces all remaining lines into the channel (see `tail.go:405-408`). If `Stream` stops reading the channel (or blocks on a closed pipe), `sendLine` deadlocks, permanently leaking the `Stream` goroutine and the tail.

**Fix:** Check the write error, issue a `t.Stop()`, and explicitly drain `t.Lines` on any early exit (including context cancellation) to unblock the internal `sendLine` logic.

```go
	ctxDone := ctx.Done()
	for {
		select {
		case <-ctxDone:
			_ = t.Stop()
			go func() {
				for range t.Lines {}
			}()
			return ctx.Err()
		case line, ok := <-t.Lines:
			if !ok {
				return nil
			}
			if _, err := fmt.Fprintln(writer, line.Text); err != nil {
				_ = t.Stop()
				go func() {
					for range t.Lines {}
				}()
				return err
			}
		}
	}
```

I2. **Watcher uses wrong logger for embedded service context** — `pkg/hostagent/events/events.go:19`, `pkg/hostagent/events/events.go:121` (important, regression)

```go
			if line.Err != nil {
				logrus.WithError(line.Err).Error("hostagent stdout tail error")
				continue
			}
```
The `pkg/hostagent/events` package explicitly imports and uses `github.com/sirupsen/logrus`, and configures the internal `tail` library to use `logrus.StandardLogger()`. Because this package is consumed by the `LimaVMReconciler` running within `rdd svc serve`, it violates the repository logging convention: controllers must use `klog` (via controller-runtime's `logr`) to write to `rdd.stderr.log`. Using `logrus` misroutes these logs to the terminal or whatever the global configuration defaults to.

**Fix:** Modify `Watch` to accept a `logr.Logger` (or extract it via `log.FromContext(ctx)`) and use a `logr`-backed adapter for the internal `tail.Config.Logger`.

I3. **Watcher dies permanently on invalid JSON line** — `pkg/hostagent/events/events.go:110-112` (important, gap)

```go
			if err := json.Unmarshal([]byte(line.Text), &ev); err != nil {
				return fmt.Errorf("failed to unmarshal %q as %T: %w", line.Text, ev, err)
			}
```
If the Lima hostagent emits a single non-JSON line to its stdout (e.g., a panic trace, an accidental library warning, or a corrupted log chunk), `json.Unmarshal` fails and `Watch` returns an error. The `runWatcher` goroutine in the `LimaVMReconciler` logs the error and exits without attempting to restart the watcher. This leaves the VM state mirror permanently wedged from the control plane's perspective.

**Fix:** Log the parsing failure as a warning and `continue` rather than aborting the entire event stream.

#### Suggestions

S1. **Polling file watcher blocks shutdown for up to POLL_DURATION** — `pkg/util/tail/watch/polling.go:87` (suggestion, gap)

```go
			time.Sleep(POLL_DURATION)
```
The polling watcher blocks its goroutine with an unresponsive `time.Sleep` instead of incorporating the delay into the `t.Dying()` select block. Because `tail.Stop()` now correctly synchronizes on the watcher goroutines via `tail.watchers.Wait()`, this sleep adds up to 250ms of blocking delay to the service shutdown or log rotation path.

**Fix:** Integrate the delay into the `t.Dying()` select to make the watcher instantly responsive to shutdown requests:

```go
			select {
			case <-t.Dying():
				return
			case <-time.After(POLL_DURATION):
			}
```

### Design Observations

**Concerns**

- **Process-wide events drainer stalls on slow tail consumers (in-scope)**: The shared `InotifyTracker` routes events from a single process-wide `t.watcher.Events` drainer into unbuffered per-file channels (`ch <- event`). If any individual tail's `ChangeEvents` goroutine is slow to read its channel (e.g., delayed by synchronous `os.Stat` calls during `Write` event handling), the central drainer blocks. This stalls event delivery for all other tails in the daemon and exacerbates the risk of `fsnotify` event buffer overflows. Consider buffering the per-file channels or using a non-blocking drop-and-notify approach to decouple the central tracker from consumer latency.

**Strengths**

- The `sync.WaitGroup` addition smoothly synchronizes `tail.Stop` and `ChangeEvents`, cleanly eliminating the `track`/`untrack` lifecycle race without introducing locking complexities into the watcher loops.

### Testing Assessment

1. The deadlock fix is comprehensively covered by `TestInotifyTrackerNoDeadlockOnRepeatedRotation`, which successfully reproduces the `readEvents` blocking race via high-volume external process writes.
2. The newly added `events.Watch` lacks unit tests for edge cases like malformed JSON recovery or stream termination behavior.

### Documentation Assessment

The exported `Stream` and `Open` functions have clear docstrings. However, `events.Watch` lacks documentation outlining its failure semantics (e.g., returning an error on invalid JSON), which is critical information given the reconciler's lack of auto-restart logic.

### Acknowledged Limitations

- **Code comment**: `pkg/hostagent/events/events.go:82-84`
  ```go
  // Do NOT call Cleanup; it unregisters the tracker entry in a way
  // that prevents the process from ever tailing the file again.
  ```
  The author explicitly notes that calling `Cleanup` breaks future tails on the same file. This is an accurate characterization of a pre-existing bug in the vendored `InotifyTracker` where `removeWatch` decrements `watchNums` unconditionally. If `untrack` is called redundantly (e.g., by `Cleanup` after `ChangeEvents` has already untracked via `Stop`), the counter drops to `-1`, causing subsequent `addWatch` calls to increment it to `0` without ever calling `watcher.Add()`.

### Coverage Summary

- `Makefile`: Trivial
- `cmd/rdd/limavm.go`: Trivial
- `cmd/rdd/service.go`: Trivial
- `go.mod`: Trivial
- `pkg/controllers/lima/limavm/controllers/hostagent_watcher.go`: Trivial
- `pkg/hostagent/events/events.go`: I2, I3
- `pkg/util/nxadmtail/tail.go`: Trivial
- `pkg/util/tail/LICENSE`: Trivial
- `pkg/util/tail/stream.go`: I1
- `pkg/util/tail/stream_test.go`: Trivial
- `pkg/util/tail/tail.go`: I1
- `pkg/util/tail/tail_posix.go`: Trivial
- `pkg/util/tail/tail_stress_test.go`: Reviewed, no issues
- `pkg/util/tail/tail_windows.go`: Trivial
- `pkg/util/tail/watch/filechanges.go`: Reviewed, no issues
- `pkg/util/tail/watch/inotify.go`: Reviewed, no issues
- `pkg/util/tail/watch/inotify_tracker.go`: Reviewed, no issues
- `pkg/util/tail/watch/polling.go`: S1
- `pkg/util/tail/watch/watch.go`: Reviewed, no issues
- `pkg/util/tail/winfile/winfile.go`: Reviewed, no issues

## Resolution

| | |
|---|---|
| **Addressed** | 2026-04-21 |
| **Commit** | `673e91e Address review 1 findings for PR #326 + prep tail for spin-out` |

| # | Finding | Action |
|---|---------|--------|
| 1 | Important #1: Stress test runs on the default unit-test lane on every CI run | Fixed |
| 2 | Suggestion #1: `events.Watch` uses the global `logrus` instead of the controller `logr` | Fixed |
| 3 | Suggestion #2: `events.Watch` aborts permanently on a single unparseable stdout line | Fixed |
| 4 | Suggestion #3: `Stream` swallows writer errors and ignores context on `follow=false` | Fixed |
| 5 | Suggestion #4: `NOTICE` still points at `pkg/util/nxadmtail/` | Fixed |
| 6 | Suggestion #5: `command.go:170` comment still references the removed `events.Watcher()` | Fixed |
| 7 | Suggestion #6: `FileWatcher.ChangeEvents` interface doc promises channel closes that the implementations do not perform | Fixed |
| 8 | Suggestion #7: Polling watcher sleeps 250 ms before noticing `tomb.Dying` | Fixed |
| 9 | Suggestion #8: `POLL_DURATION` ALL_CAPS / mutable-var rationale no longer applies | Fixed |
| 10 | Suggestion #9: Asymmetric "Do NOT call Cleanup" comment between the stdout and stderr defers | Fixed |
| 11 | Suggestion #10: Stress-test subprocess should use `t.Context()` instead of `context.Background()` | Fixed |
| 12 | Suggestion #11: `events.Watch` has no direct unit coverage | Test written |
| 13 | Testing Gap #2: Concurrent `Tail`s on the same filename | Skipped |
| 14 | Testing Gap #3: `fsnotify.NewWatcher()` error path panics | Skipped |
| 15 | Testing Gap #4: Controller-level `runWatcher` + `events.Watch` rotation test | Skipped |
