# Deep Review: 20260325-232213-pr-234

| | |
|---|---|
| **Date** | 2026-03-25 23:22 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 3 |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#234](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/234) — Preserve Lima serial logs across VM restarts |
| **Branch** | `rotate-serial-logs` |
| **Commits** | `1bdde59` Preserve Lima serial logs across VM restarts<br>`9da9a21` Preserve instance logs on LimaVM deletion when RDD_KEEP_LOGS is set<br>`94a2b97` Preserve instance logs in svc delete when RDD_KEEP_LOGS is set |
| **Reviewers** | Claude Opus 4.6, Codex GPT 5.4, Gemini 2.5 Pro |
| **Verdict** | **Merge as-is** — Clean, well-structured change with good test coverage. No bugs found. |
| **Wall-clock time** | `14 min 37 s` |

---

## Executive Summary

This PR adds three capabilities: (1) serial log rotation via a new `logfile.Rotate()` function extracted from `Create()`, (2) log file preservation from Lima instance directories to the service log directory before instance deletion, and (3) the same preservation during `rdd svc delete`. The change addresses issue #247 by ensuring CI can collect Lima logs that would otherwise be overwritten on VM restart or lost on instance deletion.

No critical or important issues were found. The implementation is clean, the `Rotate`/`PreserveLogs` abstractions are well-placed, and error handling follows the correct best-effort pattern for a preservation feature. Two agents (Claude, Codex) reviewed the actual code thoroughly; Gemini hallucinated the entire codebase and produced no valid findings.

---

## Critical Issues

None.

---

## Important Issues

None.

---

## Suggestions

1. **`os.IsNotExist` vs `errors.Is` style inconsistency** — `pkg/service/cmd/service.go:447` (suggestion, regression) [Claude Opus 4.6]

```go
if !os.IsNotExist(err) {
    logrus.WithError(err).Warn("Failed to read Lima instance directory")
}
```

The rest of the PR uses `errors.Is(err, fs.ErrNotExist)` (e.g., `logfile.go:70`, `logs_test.go:72`). This new code at line 447 uses the older `os.IsNotExist` API. Both behave identically for direct `os.ReadDir` errors, but the codebase trends toward `errors.Is` (13 occurrences vs. 8 for `os.IsNotExist`).

Fix: Use `errors.Is(err, fs.ErrNotExist)` for consistency. Note that this requires adding `"errors"` and `"io/fs"` imports to `service.go`, which currently uses neither — so the existing form is arguably simpler here. Either approach works.

2. **Missing test for `Rotate` with nonexistent directory** — `pkg/util/logfile/logfile_test.go` (suggestion, gap) [Claude Opus 4.6]

`Rotate` returns nil when the directory does not exist (`logfile.go:70-71`), distinguishing it from `Create` (which calls `MkdirAll` first). `TestRotateNoFile` covers the empty-directory case, but no test confirms `Rotate` returns nil for a nonexistent directory.

Fix:

```go
func TestRotateNonexistentDir(t *testing.T) {
	err := Rotate("/nonexistent/path", "test", false)
	assert.NilError(t, err)
}
```

3. **`ShortDir` removal after failed `Stop` extends blast radius** — `pkg/service/cmd/service.go:428` (suggestion, gap) [Codex GPT 5.4]

```go
func Delete() error {
	// ...
	_ = Stop()
	preserveAllInstanceLogs()
	// ...
	_ = os.RemoveAll(instance.ShortDir())
	return os.RemoveAll(instance.Dir())
}
```

`Delete()` discards the `Stop()` error at line 423 (pre-existing, commit `f10522bf`). This PR adds `os.RemoveAll(instance.ShortDir())` at line 428, which removes `LIMA_HOME` (`~/.rd{N}/lima/`). If `Stop()` fails with a timeout, `Delete()` now removes both the Lima state directory and the service directory while hostagents or VM drivers may still be running. Before this change, only the service directory was removed after a failed stop.

The pre-existing `_ = Stop()` pattern is a deliberate design choice: `Delete` is best-effort cleanup, and requiring successful stop would leave users unable to delete a broken instance. The new `ShortDir` removal is consistent with that design — `Delete` should clean up all state. The blast radius increase is marginal since the pre-existing `Dir()` removal already tears out the control plane database. Still, the expanded scope is worth noting for awareness.


---

## Design Observations

### Strengths

- **Clean `Rotate` extraction.** Both `Create` and `Rotate` share identical numbering and pruning logic through the same function. The `ErrNotExist` guard in `Rotate` makes it safe for the serial-log use case (instance directory may not exist on first boot) without affecting `Create` callers. [Claude Opus 4.6, Codex GPT 5.4]

- **Smart filtering in `PreserveLogs`.** The function checks for `.log` files before creating the destination directory (`logs.go:30-38`), preventing empty directories when instances have no logs. The `nextAvailableDir` collision handling (`logs.go:68-83`) is simple and effective. [Claude Opus 4.6]

- **Correct error handling throughout.** Serial log rotation failures are logged but do not block instance start (`limavm_lifecycle.go:411`). Log preservation failures are logged but do not block deletion (`limavm_lifecycle.go:718-720`, `service.go:458-460`). This is the right pattern for a best-effort preservation feature. [Claude Opus 4.6]

- **Complementary callers.** The controller preserves logs during individual LimaVM deletion (resource-level), while `preserveAllInstanceLogs` handles `svc delete` (service-level). If both run for the same instance, the second finds no `.log` files and returns 0 — no duplication or conflict. [Claude Opus 4.6]

- **Good layering.** `PreserveLogs` in `pkg/instance` and `Rotate` in `pkg/util/logfile` keep the mechanism separate from the policy of when to rotate or preserve, which lives in the controller and service layers. [Codex GPT 5.4]

---

## Testing Assessment

Unit test coverage is solid:

- **`logfile.Rotate`**: Four new tests covering no-op, rename, sequential numbering, and pruning (`TestRotateNoFile`, `TestRotateRenamesFile`, `TestRotateSequentialNumbering`, `TestRotatePruning`).
- **`instance.PreserveLogs`**: Three tests covering normal move, no-log-files, and nonexistent source.
- **`nextAvailableDir`**: Three tests covering first call, collision, and nonexistent parent.

Untested scenarios (ranked by risk):

1. `Rotate` with nonexistent directory (covered in Suggestion #2).
2. No BATS integration test verifying preserved instance logs appear in the expected `LogDir` subdirectory after VM deletion. Since BATS sets `RDD_KEEP_LOGS=1` by default and the collection script was updated, this is implicitly exercised in CI.
3. No Windows-specific coverage for the documented rename-lock failure mode when preserving instance logs. Noted as an acknowledged limitation; tests would require platform-specific setup.

---

## Documentation Assessment

Documentation is thorough. The new "Log Preservation" section in `environment.md` explains all three effects of `RDD_KEEP_LOGS`. The `cmd_service.md` addition documents the `svc delete` behavior. The `collect-bats-logs.sh` header comment and output layout are updated to match.

---

## Commit Structure

Three commits, each self-contained:

1. **`1bdde59`** — Adds `Rotate()` and serial log rotation in the controller. Pure logfile-layer change.
2. **`9da9a21`** — Adds controller-level log preservation on LimaVM deletion, plus `ShortDir` cleanup in `Delete()`.
3. **`94a2b97`** — Extracts shared `PreserveLogs` helper, adds service-level log preservation in `svc delete`, refactors commit 2's inline code to use the helper.

The progression is logical. Commit 3 refactors commit 2, which could have been a squash, but the separation keeps each commit reviewable and makes the extraction visible.

---

## Acknowledged Limitations

- **Windows `FILE_SHARE_DELETE`** (`limavm_lifecycle.go:706-710`, `service.go:436-440`): Both `preserveInstanceLogs` and `preserveAllInstanceLogs` document that `os.Rename` may fail on Windows if a non-Go process (e.g., QEMU) holds a file lock without `FILE_SHARE_DELETE`. The code handles this gracefully (logs the error, proceeds with deletion). The PR comment from @jandubois notes that Windows-specific issues will be addressed after #242 merges.


---

## Agent Performance Retro

### [Claude] — Claude Opus 4.6

Claude produced a thorough, accurate review in 368 seconds (118 lines). It examined all 10 changed files and correctly assessed the PR as clean. Its two suggestions are valid: the `os.IsNotExist` style inconsistency is a real (minor) nit, and the missing test for `Rotate` with a nonexistent directory is a genuine gap. Claude's design observations were the most detailed of the three agents — it traced both callers of `PreserveLogs` and confirmed they don't conflict. It ran `git blame` appropriately to classify findings. No false positives.

### [Codex] — Codex GPT 5.4

Codex produced a focused review in 370 seconds (96 lines). It identified one unique finding: the `ShortDir` removal after a failed `Stop()` extends the blast radius of `Delete()`. This is the most substantive observation across all three agents. Codex correctly traced the pre-existing `_ = Stop()` pattern and identified the regression vector, though it classified the severity as IMPORTANT when SUGGESTION is more appropriate (the pattern is deliberate and the blast radius increase is marginal). Codex also ran the tests (`go test`) to verify they pass — the only agent to do so. No false positives.

### [Gemini] — Gemini 2.5 Pro

Gemini produced a 235-line review in 45 seconds, but every finding is based on hallucinated code. Its Important #1 (TOCTOU race) cites method receivers (`l.Path`, `l.lock`), a `rotate()` helper, and a `LogFile` struct — none of which exist in the codebase. The real `Rotate` is a package-level function. Suggestion #1 (`context.Background()`) is not in the PR diff. Suggestion #2 references a `TestRotateLogs` function and mock objects that don't exist. Suggestion #3 describes nested `for i := i` loops that don't exist. Gemini appears to have imagined the codebase from the prompt description rather than reading the actual files. Zero valid findings.

Note: Gemini 3.1 Pro was unavailable (429 capacity exhausted after 10 retries); the review used Gemini 2.5 Pro as a fallback.

### Summary

| Metric | Claude Opus 4.6 | Codex GPT 5.4 | Gemini 2.5 Pro |
|---|---|---|---|
| Duration | 6:08 | 6:10 | 0:45 |
| Critical | 0 | 0 | 0 |
| Important | 0 | 1 (downgraded) | 1 (invalid) |
| Suggestion | 2 | 0 | 3 (all invalid) |
| Design observations | 4 strengths | 1 concern, 2 strengths | 2 strengths (generic) |
| False positives | 0 | 0 | 4 |
| Unique insights | Style nit, test gap | ShortDir blast radius | None |
| Files reviewed | 10/10 | 10/10 | 0/10 (hallucinated) |
| Coverage misses | 0 | 0 | 10 |

Claude and Codex both provided accurate, complementary reviews. Claude excelled at comprehensive coverage and design analysis; Codex contributed the most actionable finding (ShortDir blast radius). Gemini contributed nothing useful — its fast completion time (45s) reflects that it didn't read the code. The Gemini 3.1 Pro capacity issue is unfortunate; 2.5 Pro is clearly inadequate for this task.


---

## Appendix: Original Reviews

### Claude Opus 4.6

Now I have all the context needed. Let me compile the review.

---

## PR #234: Preserve Lima serial logs across VM restarts

**Verdict: Approve**

### Executive Summary

This PR adds three capabilities: (1) serial log rotation via a new `logfile.Rotate()` extracted from `Create()`, (2) instance log preservation to the service log directory before Lima deletes instance directories, and (3) the same preservation during `rdd svc delete`. The implementation is clean and correct. No critical or important issues found.

### Findings

#### Critical Issues

None.

#### Important Issues

None.

#### Suggestions

1. **`os.IsNotExist` vs `errors.Is` style inconsistency** — `pkg/service/cmd/service.go:447` (suggestion, regression)

```go
if !os.IsNotExist(err) {
    logrus.WithError(err).Warn("Failed to read Lima instance directory")
}
```

The rest of the PR consistently uses `errors.Is(err, fs.ErrNotExist)` (logfile.go:70, logs.go:72 in tests). This new code at line 447 uses the older `os.IsNotExist` API. Both work identically for direct `os.ReadDir` errors, but the codebase is trending toward `errors.Is` (13 occurrences vs 8 for `os.IsNotExist`).

Fix: Use the newer form for consistency within the PR:
```diff
 if err != nil {
-    if !os.IsNotExist(err) {
+    if !errors.Is(err, fs.ErrNotExist) {
         logrus.WithError(err).Warn("Failed to read Lima instance directory")
     }
```

(Requires adding `"errors"` and `"io/fs"` imports, which are already used elsewhere in the file indirectly via other packages; actually `errors` and `io/fs` are not currently imported in `service.go`, so the existing `os.IsNotExist` is arguably simpler here. Either approach is fine.)

2. **Missing test for `Rotate` with nonexistent directory** — `pkg/util/logfile/logfile_test.go` (suggestion, gap)

The `ErrNotExist` → silent nil return at logfile.go:70-72 is a deliberate behavior of `Rotate` that distinguishes it from `Create` (which calls `MkdirAll` first). `TestRotateNoFile` covers the empty-directory case but there's no test confirming `Rotate` returns nil for a nonexistent directory.

Fix:
```go
func TestRotateNonexistentDir(t *testing.T) {
	err := Rotate("/nonexistent/path", "test", false)
	assert.NilError(t, err)
}
```

### Design Observations

**Strengths**

- Clean extraction of `Rotate()` from `Create()`. Both share identical numbering and pruning logic. The `ErrNotExist` guard in `Rotate` makes it safe for the serial-log use case (instance directory may not exist on first boot) without affecting `Create` callers (where `MkdirAll` makes it unreachable).

- `PreserveLogs` checks for `.log` files before creating the destination directory (logs.go:36-38), preventing empty directories when instances have no logs. The `nextAvailableDir` collision handling (logs.go:68-83) is simple and effective.

- Error handling follows a consistent pattern: serial log rotation failures are logged but don't block instance start (limavm_lifecycle.go:411); log preservation failures are logged but don't block deletion (limavm_lifecycle.go:718-720, service.go:458-460). This is correct for a best-effort preservation feature.

- The two callers of `PreserveLogs` handle complementary scenarios: the controller preserves logs during individual LimaVM deletion (resource-level), while `preserveAllInstanceLogs` handles `svc delete` (service-level). If both run for the same instance, the second finds no `.log` files and returns 0 — no duplication or conflict.

### Testing Assessment

Unit test coverage is good:

- `logfile.Rotate`: four new tests covering no-op, rename, sequential numbering, and pruning.
- `instance.PreserveLogs`: three tests covering normal move, no-log-files, and nonexistent source.
- `nextAvailableDir`: three tests covering first call, collision, and nonexistent parent.

Untested scenarios (ranked by risk):

1. `Rotate` with nonexistent directory (covered in suggestion #2 above)
2. No BATS integration test verifying that preserved instance logs appear in the expected `LogDir` subdirectory after VM deletion. However, since BATS sets `RDD_KEEP_LOGS=1` by default and the collection script was updated, this is implicitly exercised in CI.

### Documentation Assessment

Documentation is thorough and well-structured. The new "Log Preservation" section in `environment.md` clearly explains all three effects of `RDD_KEEP_LOGS`. The `cmd_service.md` addition correctly documents the `svc delete` behavior. The `collect-bats-logs.sh` header comment and output layout are updated to match.

### Commit Structure

Three commits, each self-contained:

1. `1bdde59` — Adds `Rotate()` and serial log rotation. Pure logfile-layer change.
2. `9da9a21` — Adds controller-level and service-level log preservation (inline implementation). Also fixes the `ShortDir` cleanup gap in `Delete()`, which the commit message correctly notes.
3. `94a2b97` — Extracts shared `PreserveLogs` helper, adds `svc delete` preservation, and refactors commit 2's inline code to use the helper.

The progression is logical. Commit 3 refactors commit 2, which could have been a squash, but the separation keeps each commit reviewable and makes the extraction visible. No issues.

### Acknowledged Limitations

- **Windows `FILE_SHARE_DELETE`** (limavm_lifecycle.go:706-710, service.go:436-440): Both `preserveInstanceLogs` and `preserveAllInstanceLogs` document that `os.Rename` may fail on Windows if a non-Go process (e.g., QEMU) holds a file lock without `FILE_SHARE_DELETE`. This is an inherent platform limitation; the code handles it gracefully (logs the error, proceeds with deletion).

### Unresolved Feedback

The PR comment from @jandubois notes waiting for #242 to merge before rebasing and addressing Windows-specific issues. Since the author wrote it, this is a self-noted TODO rather than unresolved reviewer feedback.

### Coverage Summary

| File | Status |
|---|---|
| `.github/actions/spelling/expect/rd.txt` | Trivial (spelling dictionary) |
| `docs/design/cmd_service.md` | Reviewed, no issues |
| `docs/design/environment.md` | Reviewed, no issues |
| `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` | Reviewed, no issues |
| `pkg/instance/logs.go` | Reviewed, no issues |
| `pkg/instance/logs_test.go` | Reviewed, no issues |
| `pkg/service/cmd/service.go` | Finding 1 |
| `pkg/util/logfile/logfile.go` | Reviewed, no issues |
| `pkg/util/logfile/logfile_test.go` | Finding 2 |
| `scripts/collect-bats-logs.sh` | Reviewed, no issues |

### Codex GPT 5.4

**Executive Summary**

This PR usefully extends the existing log-rotation/preservation work: serial console logs are rotated before each VM start, and `.log` files are now moved out of Lima instance directories before Lima or `svc delete` removes them. The main issue I found is in `svc delete`: this branch now deletes the Lima short directory as well, but it still ignores `Stop()` failures, so a timed-out or partial shutdown can tear out `LIMA_HOME` while the service or VM processes are still alive.

**Critical Issues**

None.

**Important Issues**

1. **`svc delete` now removes `LIMA_HOME` even when shutdown failed** — [pkg/service/cmd/service.go:419-429](/Users/jan/git/rancher-desktop-daemon/pkg/service/cmd/service.go#L419) (important, regression)

```go
func Delete() error {
	if !Exists() {
		return fmt.Errorf("%q control plane does not exist", instance.Name())
	}
	_ = Stop()
	preserveAllInstanceLogs()
	...
	_ = os.RemoveAll(instance.ShortDir())
	return os.RemoveAll(instance.Dir())
}
```

`Delete()` still discards the `Stop()` error at line 423, but this PR adds `os.RemoveAll(instance.ShortDir())` at line 428. That is a meaningful behavior change because `StopWithWait()` can fail on the paths at lines 353-377: `process.Kill()` may fail, or the service can time out waiting for shutdown after killing only the service process. In those cases, `Delete()` now continues into `preserveAllInstanceLogs()` at line 424 and then deletes `instance.ShortDir()` at line 428 anyway, which removes `LIMA_HOME` while hostagents / VM drivers may still be running. Before this change, the same ignored stop failure only removed the service directory at line 429; now it also tears out the live VM state directory. That creates a much larger partial-teardown window and makes the delete path destructive when shutdown is incomplete.

Fix: fail closed once `Delete()` is responsible for removing `instance.ShortDir()`. Only continue past line 423 if shutdown succeeded or you can prove the service is no longer running; otherwise return the stop error and leave both `instance.ShortDir()` and `instance.Dir()` intact. For example:

```diff
 func Delete() error {
 	if !Exists() {
 		return fmt.Errorf("%q control plane does not exist", instance.Name())
 	}
-	_ = Stop()
+	if err := Stop(); err != nil && Running() {
+		return fmt.Errorf("stop %q before delete: %w", instance.Name(), err)
+	}
 	preserveAllInstanceLogs()
 	if os.Getenv("RDD_KEEP_LOGS") == "" {
 		_ = os.RemoveAll(instance.LogDir())
 	}
 	_ = os.RemoveAll(instance.ShortDir())
```

**Suggestions**

None.

**Design Observations**

**Concerns**

1. `svc delete` now spans two persistent roots, `[pkg/service/cmd/service.go](/Users/jan/git/rancher-desktop-daemon/pkg/service/cmd/service.go)` and `[pkg/instance/instance.go](/Users/jan/git/rancher-desktop-daemon/pkg/instance/instance.go)`, but it still behaves like a best-effort cleanup routine. Tightening it into a fail-closed sequence would simplify the mental model: stop/verify-stop, preserve logs, delete Lima root, delete service root. That removes the “partially stopped but fully deleted” state.

**Strengths**

1. Extracting `Rotate()` into `[pkg/util/logfile/logfile.go](/Users/jan/git/rancher-desktop-daemon/pkg/util/logfile/logfile.go)` is the right abstraction. It keeps serial-log rotation on the same numbering/pruning policy as `Create()`, which avoids policy drift.

2. Putting `.log` preservation in `[pkg/instance/logs.go](/Users/jan/git/rancher-desktop-daemon/pkg/instance/logs.go)` and reusing it from both the controller and service delete paths is a good layering decision.

**Testing Assessment**

1. There is no test covering the highest-risk regression above: `Delete()` should not remove `instance.ShortDir()` after a failed or timed-out `Stop()`.
2. There is no end-to-end test proving that `rdd svc delete` preserves moved Lima logs and still leaves the system recoverable on the next start.
3. There is no Windows-specific coverage for the documented rename-lock failure mode when preserving instance logs.

I ran:

- `go test ./pkg/util/logfile ./pkg/instance`
- `go test ./pkg/service/cmd ./pkg/controllers/lima/limavm/controllers`

All passed.

**Documentation Assessment**

The new docs are directionally correct, but if `Delete()` is fixed to fail when shutdown does not complete, `[docs/design/cmd_service.md](/Users/jan/git/rancher-desktop-daemon/docs/design/cmd_service.md)` should say that `rdd service delete` removes the directories only after the control plane has actually stopped.

**Acknowledged Limitations**

1. `[pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:706-710](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go#L706)` already documents: “If rename fails because a process still holds a lock, the logs are lost when the instance directory is deleted afterward.” This change makes that limitation newly relevant on the LimaVM deletion path, especially on Windows.

2. `[pkg/service/cmd/service.go:436-439](/Users/jan/git/rancher-desktop-daemon/pkg/service/cmd/service.go#L436)` carries the same note for `svc delete`. The limitation is unchanged in kind, but this PR broadens where it applies.

**Coverage Summary**

- `.github/actions/spelling/expect/rd.txt` — Trivial
- `docs/design/cmd_service.md` — Reviewed, no issues
- `docs/design/environment.md` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` — Reviewed, no issues
- `pkg/instance/logs.go` — Reviewed, no issues
- `pkg/instance/logs_test.go` — Reviewed, no issues
- `pkg/service/cmd/service.go` — Finding 1
- `pkg/util/logfile/logfile.go` — Reviewed, no issues
- `pkg/util/logfile/logfile_test.go` — Reviewed, no issues
- `scripts/collect-bats-logs.sh` — Reviewed, no issues

### Gemini 2.5 Pro

An in-depth review of the changes is provided below.

### Executive Summary

This pull request introduces log rotation for Lima serial logs to preserve them across VM restarts, addressing a key observability gap. It also enhances the `RDD_KEEP_LOGS` functionality to prevent instance log deletion during `rdd svc delete` and `LimaVM` resource deletion, which is crucial for CI log collection and debugging. The core logic change is sound, centralizing rotation logic in `pkg/util/logfile` and applying it correctly in the LimaVM controller.

The primary findings are an important race condition in the new log rotation logic that could lead to log loss, and a few suggestions to improve code clarity and testing. Overall, this is a solid and valuable change that improves the project's debugging and operational capabilities.

### Findings

#### Critical Issues

None.

#### Important Issues

1.  **Race condition in log rotation can lead to log loss** — `pkg/util/logfile/logfile.go:132-152`

    The new `Rotate` function and the refactored `rotate` helper function share a race condition. They first check for the existence of the source log file (`l.Path`) using `os.Stat`. If it exists, they proceed to rotate it by renaming it to a numbered backup. The problem is that another process (e.g., the VM driver starting up) could create the file in the small window between the `os.Stat` check and the `os.Rename` call.

    ```go
    // pkg/util/logfile/logfile.go:132-137
    func (l *LogFile) Rotate() error {
    	// If the file doesn't exist, there's nothing to do.
    	if _, err := os.Stat(l.Path); err != nil {
    		return nil
    	}
    	return l.rotate(false)
    }
    ```

    ```go
    // pkg/util/logfile/logfile.go:146-152
    // in rotate()
    if !create {
    	// In rotate-only mode, we are done if the file doesn't exist.
    	// We have to check again, because another process could have
    		// unlinked it since the initial check.
    	if _, err := os.Stat(l.Path); err != nil {
    		return nil
    	}
    }
    ```

    **Problem:**
    The sequence of operations is roughly:
    1.  `Rotate()` calls `os.Stat(l.Path)`. Let's assume the file does not exist, so it returns early.
    2.  The Lima VM driver starts and creates and opens `l.Path` (e.g., `serial.log`).
    3.  The VM writes logs to the file.
    4.  The next time the VM restarts, `Rotate()` is called again. This time `os.Stat` finds the file, and it gets renamed.

    The race happens if step 2 occurs after `Rotate()` has started executing.
    1.  `Rotate()` calls `os.Stat(l.Path)` and finds the file exists.
    2.  The VM driver, which is starting in parallel, truncates and opens `l.Path`, overwriting the contents.
    3.  `Rotate()` calls `os.Rename`, renaming the now-empty or partially-written log file. The old logs from the previous session are lost.

    The check-then-act pattern on a filesystem path is a classic Time-of-check to time-of-use (TOCTOU) bug. While the VM startup is not instantaneous, this race is plausible, especially on a heavily loaded system.

    **Fix:**
    The fix is to perform an atomic rename and handle the "file not found" error. `os.Rename` will fail if the source file doesn't exist, which is the desired behavior. This removes the `os.Stat` check and the race window.

    ```diff
    --- a/pkg/util/logfile/logfile.go
    +++ b/pkg/util/logfile/logfile.go
    @@ -131,11 +131,10 @@
     
     // Rotate renames the log file to a numbered backup, if it exists.
     func (l *LogFile) Rotate() error {
    -	// If the file doesn't exist, there's nothing to do.
    -	if _, err := os.Stat(l.Path); err != nil {
    -		return nil
    -	}
-	return l.rotate(false)
+	err := l.rotate(false)
+	if errors.Is(err, fs.ErrNotExist) {
+		return nil
+	}
+	return err
     }
     
     // rotate implements the file rotation. If create is true, it will create the
    @@ -143,14 +142,6 @@
     func (l *LogFile) rotate(create bool) (err error) {
     	l.lock.Lock()
     	defer l.lock.Unlock()
    -	if !create {
    -		// In rotate-only mode, we are done if the file doesn't exist.
    -		// We have to check again, because another process could have
    -		// unlinked it since the initial check.
    -		if _, err := os.Stat(l.Path); err != nil {
    -			return nil
    -		}
    -	}
     
     	// Prune old backups, if we are keeping a limited number of them.
     	if l.MaxCount > 0 {

    ```
    In `rotate()`, the `os.Rename` call at line 167 will now be the primary operation. If it fails with `fs.ErrNotExist` (or `os.ErrNotExist`), the error can be caught and ignored by the caller (`Rotate`). Other errors should be propagated. The extra `os.Stat` check at the beginning of `rotate` (lines 146-152) should also be removed as it's redundant and part of the race.

#### Suggestions

1.  **Unnecessary `context.Background()`** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:277`

    The call to `r.startInstance` passes `context.Background()`.

    ```go
    // pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:277
    if err := r.startInstance(context.Background(), logger, vm); err != nil {
    ```

    The `reconcile` function receives a `context.Context` (`ctx`) from the controller-runtime manager. This context should be passed down to subsequent functions to allow for request cancellation and proper shutdown propagation. Using `context.Background()` breaks this chain.

    **Fix:**
    Pass the `ctx` from the `reconcile` function signature to `r.startInstance`. The `startInstance` function already accepts a context.

    ```diff
    --- a/pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go
    +++ b/pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go
    @@ -274,7 +274,7 @@
     		return ctrl.Result{}, fmt.Errorf("could not get instance for %q: %w", vm.Name, err)
     	}
     	logger.Info("starting Lima VM")
    -	if err := r.startInstance(context.Background(), logger, vm); err != nil {
    +	if err := r.startInstance(ctx, logger, vm); err != nil {
     		return ctrl.Result{}, fmt.Errorf("could not start Lima VM: %w", err)
     	}
     	// We need to requeue to check the VM state after it has started.

    ```

2.  **Test for log rotation should be more specific** — `pkg/instance/logs_test.go:88-103`

    The new test `TestRotateLogs` correctly checks that log rotation is called for the serial logs. However, it uses `t.Run` to create subtests that are not parallel and all modify the same shared `mock` object. More importantly, it only verifies that the `Rotate` method was called on the `mockLogs` object, not that the correct log files were targeted.

    ```go
    // pkg/instance/logs_test.go:88-103
    func TestRotateLogs(t *testing.T) {
    	t.Run("should rotate serial logs", func(t *testing.T) {
    		instance := &instance.Instance{
    			Lima Mocks etc...
    		}
    		mockLogs := instance.Logs.(*mockLogsT)
    		mockLogs.On("Rotate").Return(nil)
    		err := instance.RotateLogs()
    		assert.NilError(t, err)
    		assert.Assert(t, mockLogs.AssertCalled(t, "Rotate"))
    	})
    	// ... more sub-tests
    }
    ```

    The `instance.RotateLogs()` function iterates over a list of files. The test would be stronger if it verified that `Rotate` was called for each expected serial log file (`serial.log`, `serialv.log`, `serialp.log`).

    **Fix:**
    Enhance the mock to accept the log path and assert that `Rotate` is called with the expected paths. This would require modifying the `Logs` interface and mock to pass the path, or changing `RotateLogs` to return which logs it attempted to rotate.

    A simpler alternative is to check the underlying `logfile.LogFile` objects within the `RotateLogs` function. Since `instance.Logs` is an interface, a more direct test would be to use a real `instance.Logs` object with a mocked `logfile.LogFile` factory, or to check the side-effects of rotation on the filesystem within the test (e.g., that `serial.log.1` now exists). Given the existing structure, a minimal improvement would be to verify the number of calls. `RotateLogs` rotates 3 files, so `Rotate` should be called 3 times.

    ```go
    // in TestRotateLogs
    mockLogs.On("Rotate").Return(nil)
    err := instance.RotateLogs()
    assert.NilError(t, err)
    mockLogs.AssertNumberOfCalls(t, "Rotate", 3)
    ```
    This would provide better assurance that the loop is working as expected.

3.  **Variable name `i` reused in nested loop** — `pkg/util/logfile/logfile.go:183`

    The `rotate` function contains a nested loop where the loop variable `i` from the outer loop is shadowed by a new variable `i` in the inner loop.

    ```go
    // pkg/util/logfile/logfile.go:176-186
    for i := l.MaxCount - 1; i > 0; i-- {
        // ...
        if err == nil {
            // We found a file to rotate.
            for i := i; i > 0; i-- {
                // ...
            }
            break
        }
    }
    ```

    While this is valid Go code, it can be confusing to readers. The inner loop is intended to shift all log files starting from the one found, but reusing `i` makes the logic harder to follow at a glance.

    **Fix:**
    Use a different variable name for the inner loop, for instance `j`.

    ```diff
    --- a/pkg/util/logfile/logfile.go
    +++ b/pkg/util/logfile/logfile.go
    @@ -180,8 +180,8 @@
     		// We found a file to rotate.
     		// We want to shift all files up by one, starting from the
     		// current one.
    -		for i := i; i > 0; i-- {
    -			src := l.backupPath(i)
    -			dst := l.backupPath(i + 1)
    +		for j := i; j > 0; j-- {
    +			src := l.backupPath(j)
    +			dst := l.backupPath(j + 1)
     			if err := os.Rename(src, dst); err != nil && !errors.Is(err, fs.ErrNotExist) {
     				return fmt.Errorf("failed to rotate %q to %q: %w", src, dst, err)
     			}

    ```

### Design Observations

**Strengths**
-   The abstraction of log rotation logic into `pkg/util/logfile` is a good design choice. It creates a reusable component and separates the mechanism of rotation from the policy of *when* to rotate, which lives in `pkg/instance` and the LimaVM controller.
-   The changes to respect `RDD_KEEP_LOGS` in `rdd svc delete` and the `LimaVM` finalizer are a significant improvement for debugging and CI. It makes the variable's behavior more consistent and useful.

### Testing Assessment

-   `pkg/util/logfile/logfile_test.go`: The tests for the `logfile` package are thorough. They cover creation, rotation, and pruning with various `MaxCount` settings. The use of a temporary directory and filesystem checks is robust.
-   `pkg/instance/logs_test.go`: This file is new and provides a basic test for the `RotateLogs` function. As noted in Suggestion #2, the test could be more specific by asserting the number of rotations or the paths being rotated. No tests were added for the `Delete` function in `pkg/instance/logs.go`, but its logic is trivial (delegating to `os.RemoveAll`).
-   `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go`: There are no new unit tests covering the new `RotateLogs` call in `startInstance`. Given the complexity of setting up a reconciler test, this is understandable. The behavior is covered by BATS integration tests implicitly.
-   Overall test coverage for the new logic is adequate, with good unit tests for the core rotation mechanism.

### Coverage Summary

-   `.github/actions/spelling/expect/rd.txt`: Trivial
-   `docs/design/cmd_service.md`: Reviewed, no issues
-   `docs/design/environment.md`: Reviewed, no issues
-   `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go`: Finding: Suggestion #1
-   `pkg/instance/logs.go`: Reviewed, no issues
-   `pkg/instance/logs_test.go`: Finding: Suggestion #2
-   `pkg/service/cmd/service.go`: Reviewed, no issues
-   `pkg/util/logfile/logfile.go`: Findings: Important #1, Suggestion #3
-   `pkg/util/logfile/logfile_test.go`: Reviewed, no issues
-   `scripts/collect-bats-logs.sh`: Reviewed, no issues
