# Deep Review: 20260325-180207-pr-234

| | |
|---|---|
| **Date** | 2026-03-25 18:02 |
| **Repo** | [rancher-sandbox/rancher-desktop-daemon](https://github.com/rancher-sandbox/rancher-desktop-daemon) |
| **Round** | 2 |
| **Author** | [@jandubois](https://github.com/jandubois) |
| **PR** | [#234](https://github.com/rancher-sandbox/rancher-desktop-daemon/pull/234) — Preserve Lima serial logs across VM restarts |
| **Branch** | `rotate-serial-logs` |
| **Commits** | `2ce15bc` Preserve Lima serial logs across VM restarts<br>`df1d963` Preserve instance logs on LimaVM deletion when RDD_KEEP_LOGS is set<br>`480661b` Preserve instance logs in svc delete when RDD_KEEP_LOGS is set |
| **Reviewers** | Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro |
| **Verdict** | **Merge with fixes** — Windows file-lock race undermines log preservation in CI |
| **Wall-clock time** | `14 min 40 s` |

---

## Consolidated Review

### Executive Summary

This PR adds three log preservation mechanisms: serial log rotation before each VM start, instance log preservation during LimaVM deletion, and the same during `rdd svc delete` — all gated on `RDD_KEEP_LOGS`. The implementation extracts a clean `Rotate` helper from `Create` and introduces a shared `PreserveLogs` function with atomic directory naming. The main concern is that both preservation paths treat failures as best-effort warnings, then proceed to delete the source directories — on Windows, where killed processes retain file locks briefly, this race will cause the preservation to fail and the logs to be deleted, defeating the PR's purpose for Windows CI.

**Structure:** 0 critical, 1 important, 4 suggestions.

---

### Critical Issues

None.

---

### Important Issues

1. **Log preservation failures do not prevent source deletion** — `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:705-719`, `pkg/service/cmd/service.go:435-459` (important, gap) [Codex GPT 5.4, Gemini 3.1 Pro]

```go
// Controller path (limavm_lifecycle.go:711-714)
count, err := instance.PreserveLogs(inst.Dir, inst.Name)
if err != nil {
    logger.Error(err, "Failed to preserve instance logs")
    return
}
```

Both preservation paths — `preserveInstanceLogs` in the controller (line 705) and `preserveAllInstanceLogs` in the service (line 435) — log failures as warnings, then allow the caller to delete the source directory. In the controller, `preserveInstanceLogs` returns void at line 714, and `handleDeletion` calls `limainstance.Delete` at line 82 regardless. In the service, `preserveAllInstanceLogs` continues to the next instance at line 454, and `Delete()` calls `os.RemoveAll(instance.ShortDir())` at line 428.

On Windows, `stopInstanceForcibly` (line 55) and `Stop()` (line 423) terminate processes, but Windows does not release file locks until the process handle is fully closed. The subsequent `os.Rename` calls in `PreserveLogs` (line 54 of `logs.go`) fail with `ERROR_SHARING_VIOLATION`, the error is logged, and the source logs are then deleted — the exact outcome this PR aims to prevent.

On Unix this race is unlikely because killed processes release file descriptors immediately, and `os.Rename` does not require exclusive access. The practical impact is limited to Windows CI, which is the PR author's primary concern (issue #247).

Fix: In `preserveInstanceLogs`, return an error and let `handleDeletion` decide whether to retry or proceed. In `Delete()`, add a brief `time.Sleep` or poll for file accessibility after `Stop()` on Windows before calling `preserveAllInstanceLogs`. Alternatively, accept the best-effort design but document the Windows limitation with a TODO.

```diff
-func preserveInstanceLogs(ctx context.Context, inst *limatype.Instance) {
+func preserveInstanceLogs(ctx context.Context, inst *limatype.Instance) error {
     if os.Getenv("RDD_KEEP_LOGS") == "" {
-        return
+        return nil
     }
     logger := log.FromContext(ctx)
     count, err := instance.PreserveLogs(inst.Dir, inst.Name)
     if err != nil {
-        logger.Error(err, "Failed to preserve instance logs")
-        return
+        return fmt.Errorf("preserve instance logs: %w", err)
     }
     if count > 0 {
         logger.Info("Preserved instance logs", "instance", inst.Name, "count", count)
     }
+    return nil
 }
```

---

### Suggestions

1. **`Rotate` could handle nonexistent directory gracefully** — `pkg/util/logfile/logfile.go:66-69` (suggestion, enhancement) [Gemini 3.1 Pro]

```go
entries, err := os.ReadDir(dir)
if err != nil {
    return fmt.Errorf("read log directory %s: %w", dir, err)
}
```

`Rotate` propagates `os.ReadDir` errors, including `fs.ErrNotExist`. In `startInstance` (line 410 of `limavm_lifecycle.go`), `inst.Dir` always exists because `store.Inspect` has already succeeded, so this does not trigger in practice. However, as a public API, `Rotate` would be more robust if it treated a missing directory as a no-op — there are no logs to rotate.

Fix: Return nil for `fs.ErrNotExist`.

```diff
 entries, err := os.ReadDir(dir)
 if err != nil {
+    if errors.Is(err, fs.ErrNotExist) {
+        return nil
+    }
     return fmt.Errorf("read log directory %s: %w", dir, err)
 }
```

2. **`os.Rename` fails across filesystems** — `pkg/instance/logs.go:54` (suggestion, gap) [Claude Opus 4.6, Gemini 3.1 Pro]

```go
if err := os.Rename(src, dst); err != nil {
    return count, fmt.Errorf("preserve %s: %w", entry.Name(), err)
}
```

`os.Rename` returns `EXDEV` when source and destination are on different filesystems. In practice, `LimaHome()` (under `~/.rd{suffix}/lima/`) and `LogDir()` (under `~/Library/Application Support/rancher-desktop-{instance}/log/`) are on the same filesystem. On Windows, both live under `%LOCALAPPDATA%`. A cross-device failure would require deliberate filesystem layout changes.

Fix: If this ever becomes an issue, add a copy-then-delete fallback. For now, a comment noting the assumption suffices.

3. **Empty directory left behind on partial `PreserveLogs` failure** — `pkg/instance/logs.go:45-58` (suggestion, gap) [Gemini 3.1 Pro]

```go
destDir, err := nextAvailableDir(logDir, instanceName)
// ...
for _, entry := range logFiles {
    if err := os.Rename(src, dst); err != nil {
        return count, fmt.Errorf("preserve %s: %w", entry.Name(), err)
    }
```

`nextAvailableDir` creates the destination directory before the rename loop. If the first `os.Rename` fails, the directory is left empty. Successive failures accumulate empty numbered directories (`opensuse`, `opensuse.2`, `opensuse.3`, ...).

Fix: If `count == 0` on error return, remove the empty directory.

```diff
+    if count == 0 {
+        os.Remove(destDir)
+    }
     return count, fmt.Errorf("preserve %s: %w", entry.Name(), err)
```

4. **Test files use different `os.IsNotExist` conventions** — `pkg/instance/logs_test.go:72,90` vs `pkg/util/logfile/logfile_test.go` (suggestion, regression) [Claude Opus 4.6]

`logs_test.go` uses the older `os.IsNotExist(err)` pattern while `logfile_test.go` (in the same PR) uses the modern `errors.Is(err, fs.ErrNotExist)`. Both work, but the inconsistency within one PR is avoidable.

Fix: Align `logs_test.go` to use `errors.Is(err, fs.ErrNotExist)` to match `logfile_test.go`.

---

### Design Observations

**Strengths**

- **Clean `Rotate` extraction.** Factoring rotation logic from `Create` into `Rotate` enables serial log rotation without creating a new file, while keeping `Create`'s behavior identical. The shared numbering and pruning code eliminates duplication. [Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro]

- **Belt-and-suspenders preservation.** The controller's `preserveInstanceLogs` (API-driven deletion) and the service's `preserveAllInstanceLogs` (`svc delete`) cover complementary paths without interfering. If the controller already deleted an instance, `preserveAllInstanceLogs` finds nothing to move. If the controller was interrupted by `Stop()`, the service path catches remaining logs. [Claude Opus 4.6]

- **Atomic directory naming.** `nextAvailableDir` uses `os.Mkdir` with `fs.ErrExist` checks for race-free directory creation. Two concurrent callers cannot claim the same directory name. The 1000-entry limit prevents runaway accumulation. [Claude Opus 4.6]

- **Log filtering before directory creation.** `PreserveLogs` checks for `.log` files before calling `nextAvailableDir`, avoiding empty directories when no logs exist. [Claude Opus 4.6]

---

### Testing Assessment

Test coverage is good. New unit tests cover:

- `logfile.Rotate`: no-file (no-op), single rename, sequential numbering, pruning — 4 tests
- `instance.PreserveLogs`: moves log files, skips non-logs and directories, no-log-files (no-op), nonexistent source — 4 tests
- `instance.nextAvailableDir`: first call, collision numbering, nonexistent parent — 3 tests

Untested scenarios ranked by risk:

1. **Windows file-lock race in preservation** — No test covers `RDD_KEEP_LOGS=1` with a LimaVM deletion while processes still hold file locks. This is the path most likely to fail on Windows CI.
2. **`PreserveLogs` partial failure** — If a rename fails midway, the function returns a partial count and an error. No test verifies the destination directory state (some files moved, empty directory left behind).
3. **End-to-end `svc delete` with log preservation** — No integration test covers `rdd svc delete` preserving instance logs and removing `LIMA_HOME` afterward.

---

### Documentation Assessment

Documentation updates are thorough:

- `environment.md` consolidates `RDD_KEEP_LOGS` behavior into a dedicated "Log Preservation" section with sub-sections for rotation, preservation, and BATS defaults.
- `cmd_service.md` adds a description for the previously-empty `rdd service delete` section.
- `logfile.Rotate` and `instance.PreserveLogs` have complete godoc comments.
- Internal functions (`preserveInstanceLogs`, `preserveAllInstanceLogs`, `nextAvailableDir`) have clear doc comments.

One minor gap: the header comment in `scripts/collect-bats-logs.sh` still describes the script as collecting service and hostagent logs, but the script now also collects preserved instance-log directories. [Codex GPT 5.4]

---

### Commit Structure

The three commits represent a clean progression:

1. **2ce15bc** — Extract `Rotate`, add serial log rotation. Self-contained.
2. **df1d963** — Add instance log preservation during LimaVM deletion and `ShortDir` cleanup.
3. **480661b** — Extract `PreserveLogs` helper, add `svc delete` preservation. Refactors the inline implementation from commit 2 into a shared helper.

Commit messages accurately describe the changes. No fixup commits.

---

## Agent Performance Retro

### [Claude] Claude Opus 4.6

- **Unique contributions:** Identified the belt-and-suspenders design strength (controller + service paths complement each other without interfering). Noted the `os.IsNotExist` vs `errors.Is` style inconsistency across the PR's test files. Observed that `nextAvailableDir` uses `os.Mkdir` for atomic directory creation.
- **Accuracy:** High. Both suggestions are valid, correctly scoped, and appropriately rated. No false positives.
- **Depth:** Read all changed files, traced into callers. Did not flag the Windows file-lock race that Codex and Gemini identified — a notable miss given the repo-specific context about Windows process lifecycle issues.
- **Coverage:** Complete. All 10 files reviewed with explicit status.
- **Signal-to-noise:** Excellent. Two targeted suggestions, no noise. The design strengths section provided genuinely useful observations.

### [Codex] Codex GPT 5.4

- **Unique contributions:** Identified that preservation failures are swallowed before deletion proceeds, tracing both the controller path (line 77 → line 82) and the service path (line 424 → line 428). Flagged the `ShortDir` error handling gap in `Delete()`.
- **Accuracy:** Mostly accurate. Finding 1 correctly identifies the preserve-then-delete race. Finding 2 overstates the severity of the `ShortDir` error — the `_ = Stop()` pattern is pre-existing (commit f10522bf), and the new `_ = os.RemoveAll(instance.ShortDir())` follows the established convention. Codex's fix to propagate `Stop()` errors is out of scope for this PR.
- **Depth:** Traced the full deletion flow through both code paths. Used line-level references.
- **Coverage:** Complete. All 10 files reviewed.
- **Signal-to-noise:** Good. Finding 1 is the review's most valuable insight. Finding 2 mixes a pre-existing gap with the new change.

### [Gemini] Gemini 3.1 Pro

- **Unique contributions:** Flagged the empty directory left behind on partial `PreserveLogs` failure. Identified the Windows file-lock race from the `svc delete` perspective. Found the acknowledged limitation about stale PID files on Windows.
- **Accuracy:** Mixed. Finding 1 (Rotate on nonexistent directory) overstates severity as IMPORTANT — the stated scenario ("first boot of a fresh VM, `inst.Dir` may not exist yet") is incorrect because `store.Inspect` must succeed before `startInstance` is called, guaranteeing the directory exists. Finding 2 (Windows file locks) is valid and complements Codex's analysis.
- **Depth:** Good use of git blame on findings. Traced into `StopWithWait` and process lifecycle code.
- **Coverage:** Complete. All 10 files reviewed. Marked `limavm_lifecycle.go` as "Reviewed, no issues" despite the file containing the preservation gap that Codex flagged — a coverage miss.
- **Signal-to-noise:** Good. The empty-directory suggestion is a genuine improvement. The Rotate finding needed severity recalibration.

### Summary

| Metric | Claude Opus 4.6 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 6:10 | 3:40 | 5:39 |
| Critical | 0 | 0 | 0 |
| Important | 0 | 2 | 2 |
| Suggestion | 2 | 0 | 2 |
| Design observations | 5 | 2 | 1 |
| False positives | 0 | 0 (1 overscoped) | 1 (severity inflation) |
| Unique insights | 2 | 2 | 2 |
| Files reviewed | 10 | 10 | 10 |
| Coverage misses | 1 (Windows race) | 0 | 1 (controller preserve gap) |

Codex delivered the most actionable finding (preserve-then-delete race) in the shortest time. Gemini independently confirmed the Windows angle and added the empty-directory cleanup suggestion. Claude provided the best design analysis and highest signal-to-noise ratio but missed the review's central finding. The multi-agent approach proved its value: no single agent identified all the important issues.

---

## Appendix: Original Reviews

### Claude Opus 4.6

Now I have everything needed for a comprehensive review. Let me compile the findings.

---

## PR #234: Preserve Lima serial logs across VM restarts

### Executive Summary

This PR adds three complementary log preservation mechanisms: (1) serial console log rotation before each VM start, (2) instance log preservation during LimaVM deletion when `RDD_KEEP_LOGS` is set, and (3) the same preservation during `rdd svc delete`. The implementation is clean, with a well-factored `Rotate` extraction and a shared `PreserveLogs` helper. The change also fixes a pre-existing gap where `ShortDir` was not cleaned up by `svc delete`. No critical issues found.

**Verdict: Approve.** The suggestions below are improvements, not blockers.

### Findings

#### Critical Issues

None.

#### Important Issues

None.

#### Suggestions

1. **`os.Rename` assumes same filesystem** — `pkg/instance/logs.go:54`

```go
if err := os.Rename(src, dst); err != nil {
    return count, fmt.Errorf("preserve %s: %w", entry.Name(), err)
}
```

`os.Rename` fails with `EXDEV` when `srcDir` (Lima instance under `~/.rd{suffix}/lima/`) and `LogDir()` are on different filesystems. In practice both are under `~`, so this won't happen on standard macOS, Linux, or Windows layouts. However, CI environments or unusual mount configurations could trigger it.

Given that this is a debugging feature (`RDD_KEEP_LOGS`) and the error is logged and handled gracefully by both callers, the practical risk is low. If it ever becomes an issue, the fix is a copy-then-delete fallback.

Fix: Add a comment noting the same-filesystem assumption, or implement a `moveFile` helper that falls back to copy+delete on `EXDEV`. (suggestion, gap)

2. **Test style inconsistency: `os.IsNotExist` vs `errors.Is`** — `pkg/instance/logs_test.go:72,90`

```go
assert.Assert(t, os.IsNotExist(err), "expected serial.log to be removed from source")
```

The `logfile_test.go` tests (same PR) use the modern `errors.Is(err, fs.ErrNotExist)` pattern, while `logs_test.go` uses the older `os.IsNotExist`. Both work, but the new test files within the same PR use different conventions. The production code in `service.go:441` also uses `os.IsNotExist`, so `logs_test.go` is technically consistent with production code.

Fix: Pick one convention within the PR's new test files. Since `logfile_test.go` already uses `errors.Is(err, fs.ErrNotExist)`, align `logs_test.go` to match. (suggestion, regression)

### Design Observations

**Strengths**

- **Clean `Rotate` extraction.** Factoring the rotation logic out of `Create` into `Rotate` is the right abstraction — it enables serial log rotation without creating a new file, while keeping `Create`'s behavior identical. The shared numbering and pruning code eliminates duplication.

- **Belt-and-suspenders preservation.** The controller's `preserveInstanceLogs` (for API-driven deletion) and the service's `preserveAllInstanceLogs` (for `svc delete`) cover complementary paths without interfering. If the controller already deleted an instance, `preserveAllInstanceLogs` correctly finds nothing to move. If the controller was interrupted by `Stop()`, the service path catches remaining logs.

- **`nextAvailableDir` uses `os.Mkdir` for atomic directory creation.** The create-or-increment loop with `fs.ErrExist` checks is race-free (two concurrent callers cannot claim the same directory name). The 1000-entry limit is a sensible upper bound.

- **`ShortDir` cleanup.** Adding `os.RemoveAll(instance.ShortDir())` to `Delete()` fixes a pre-existing gap where `~/.rd{suffix}/` survived `svc delete`. Good housekeeping bundled with the main feature.

- **Script update is complete.** The `collect-bats-logs.sh` changes correctly iterate preserved-log subdirectories before falling through to surviving Lima instance directories, with distinct output paths (`$dest/${vm_name}` vs `$dest/lima-${vm_name}`) to avoid collisions.

### Testing Assessment

Test coverage is good. New unit tests cover:

- `logfile.Rotate`: no-file (no-op), single rename, sequential numbering, pruning — 4 tests
- `instance.PreserveLogs`: moves log files, skips non-logs and directories, no-log-files (no-op), nonexistent source — 4 tests (including the `assertFileContent` helper)
- `instance.nextAvailableDir`: first call, collision numbering, nonexistent parent — 3 tests

Untested scenarios (ranked by risk):

1. **`PreserveLogs` partial failure** — if a rename fails midway, the function returns with a partial count and an error. The destination directory is left with some files. No test verifies this state. Risk: low (the callers handle the error correctly).
2. **`Rotate` on nonexistent directory** — `Rotate` returns an error from `os.ReadDir`, which is the right behavior. `Create` calls `MkdirAll` first so this path only matters for direct `Rotate` callers (the serial log rotation in `startInstance`, where `inst.Dir` always exists). Risk: very low.

### Documentation Assessment

Documentation is thorough and well-structured:

- `environment.md` consolidates `RDD_KEEP_LOGS` behavior into a dedicated "Log Preservation" section with clear sub-sections for rotation, preservation, and BATS defaults.
- `cmd_service.md` adds a description for the previously-empty `rdd service delete` section.
- `logfile.Rotate` and `instance.PreserveLogs` have complete godoc.
- Internal functions (`preserveInstanceLogs`, `preserveAllInstanceLogs`, `nextAvailableDir`) have clear doc comments.

No documentation gaps identified.

### Commit Structure

The three commits represent a clean progression:

1. **2ce15bc** — Extract `Rotate`, add serial log rotation. Self-contained.
2. **df1d963** — Add instance log preservation during LimaVM deletion and `ShortDir` cleanup. The `ShortDir` fix is somewhat orthogonal to log preservation, but it's small and logically related (both are about cleaning up properly during deletion).
3. **480661b** — Extract `PreserveLogs` helper, add `svc delete` preservation. Refactors the inline implementation from commit 2 into a shared helper.

Commit messages accurately describe the changes and explain the motivation. No fixup commits.

### Acknowledged Limitations

None found in the changed code.

### Coverage Summary

| File | Status |
|------|--------|
| `.github/actions/spelling/expect/rd.txt` | **Trivial** — spelling dictionary entries for `serialp`, `serialv` |
| `docs/design/cmd_service.md` | **Reviewed, no issues** — adds `svc delete` description |
| `docs/design/environment.md` | **Reviewed, no issues** — consolidates `RDD_KEEP_LOGS` docs |
| `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` | **Reviewed, no issues** — serial rotation and `preserveInstanceLogs` placement is correct |
| `pkg/instance/logs.go` | **Finding 1** — `os.Rename` same-filesystem assumption |
| `pkg/instance/logs_test.go` | **Finding 2** — `os.IsNotExist` vs `errors.Is` inconsistency |
| `pkg/service/cmd/service.go` | **Reviewed, no issues** — `Delete()` ordering is correct |
| `pkg/util/logfile/logfile.go` | **Reviewed, no issues** — `Rotate` extraction preserves `Create` semantics |
| `pkg/util/logfile/logfile_test.go` | **Reviewed, no issues** — comprehensive `Rotate` tests |
| `scripts/collect-bats-logs.sh` | **Reviewed, no issues** — correctly iterates preserved-log subdirectories |


### Codex GPT 5.4

**Verdict** Needs changes

### Executive Summary
This PR adds serial-log rotation, extracts shared rotation logic, and tries to preserve Lima instance logs when `RDD_KEEP_LOGS` is set during `LimaVM` deletion and `rdd svc delete`. The rotation changes look sound, but the new preservation flow is still best-effort while the next step deletes the source directory, and `svc delete` now drops `LIMA_HOME` errors on the floor, so the PR can still lose logs or report a successful delete when the VM data was not actually removed.

### Findings

#### Critical Issues
None.

#### Important Issues
1. **Preservation failures are logged and then the source logs are deleted anyway** — [pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:77](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go#L77) (important, regression)

```go
count, err := instance.PreserveLogs(inst.Dir, inst.Name)
if err != nil {
    logger.Error(err, "Failed to preserve instance logs")
    return
}
...
err = limainstance.Delete(deleteCtx, existingInst, true)
```

`instance.PreserveLogs()` moves files with `os.Rename()` at [pkg/instance/logs.go:54](/Users/jan/git/rancher-desktop-daemon/pkg/instance/logs.go#L54). On Windows, renaming an in-use file fails, and the finalizer path makes that realistic: `handleDeletion()` force-kills the hostagent/driver at [pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:55](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go#L55), but `stopInstanceForcibly()` at [pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:633](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go#L633) does not wait for those processes to exit before `preserveInstanceLogs()` runs at line 77. If `PreserveLogs()` fails, `preserveInstanceLogs()` swallows the error at lines 711-714 and `limainstance.Delete()` still removes the instance directory at line 82. The same bug exists in `rdd svc delete`: `preserveAllInstanceLogs()` only warns on `PreserveLogs()` failure at [pkg/service/cmd/service.go:451](/Users/jan/git/rancher-desktop-daemon/pkg/service/cmd/service.go#L451)-[454](/Users/jan/git/rancher-desktop-daemon/pkg/service/cmd/service.go#L454), then `Delete()` removes `LIMA_HOME` at [pkg/service/cmd/service.go:428](/Users/jan/git/rancher-desktop-daemon/pkg/service/cmd/service.go#L428). In both flows, a transient preserve failure turns into permanent log loss.

Fix: make preservation part of the delete precondition. Return an error from the preservation helpers, retry if needed after a short wait on Windows, and abort deletion if the logs were not moved successfully.

```diff
-func preserveInstanceLogs(ctx context.Context, inst *limatype.Instance) {
+func preserveInstanceLogs(ctx context.Context, inst *limatype.Instance) error {
     if os.Getenv("RDD_KEEP_LOGS") == "" {
-        return
+        return nil
     }
-    count, err := instance.PreserveLogs(inst.Dir, inst.Name)
+    count, err := instance.PreserveLogs(inst.Dir, inst.Name)
     if err != nil {
-        logger.Error(err, "Failed to preserve instance logs")
-        return
+        return fmt.Errorf("preserve instance logs: %w", err)
     }
     if count > 0 { ... }
+    return nil
 }
...
- preserveInstanceLogs(ctx, existingInst)
+ if err := preserveInstanceLogs(ctx, existingInst); err != nil {
+     return ctrl.Result{}, err
+ }
```

2. **`svc delete` can report success even when the new `LIMA_HOME` deletion failed** — [pkg/service/cmd/service.go:423](/Users/jan/git/rancher-desktop-daemon/pkg/service/cmd/service.go#L423) (important, regression)

```go
_ = Stop()
preserveAllInstanceLogs()
if os.Getenv("RDD_KEEP_LOGS") == "" {
    _ = os.RemoveAll(instance.LogDir())
}
_ = os.RemoveAll(instance.ShortDir())
return os.RemoveAll(instance.Dir())
```

This PR adds `_ = os.RemoveAll(instance.ShortDir())` at line 428 so `svc delete` now tries to remove the Lima home, but it still ignores both the `Stop()` error at line 423 and the `RemoveAll(instance.ShortDir())` error at line 428. That means `rdd svc delete` can return `nil` from deleting [pkg/instance/instance.go:39](/Users/jan/git/rancher-desktop-daemon/pkg/instance/instance.go#L39) even if the Lima VM directory under [pkg/instance/instance.go:113](/Users/jan/git/rancher-desktop-daemon/pkg/instance/instance.go#L113) was left behind. On Windows this is especially likely when a VM process or log file is still open, because `os.RemoveAll` does not remove in-use files there. After this change, the command can now claim the control plane was deleted while leaving the actual VM state on disk.

Fix: stop treating the new `ShortDir()` removal as best-effort. Propagate `Stop()` failures, and aggregate the `ShortDir()` / `Dir()` removal errors so the caller sees that deletion was incomplete.

```diff
- _ = Stop()
+ if err := Stop(); err != nil {
+     return err
+ }
...
- _ = os.RemoveAll(instance.ShortDir())
- return os.RemoveAll(instance.Dir())
+ return kerrors.NewAggregate([]error{
+     os.RemoveAll(instance.ShortDir()),
+     os.RemoveAll(instance.Dir()),
+ })
```

#### Suggestions
None.

### Design Observations

**Strengths**
- Extracting `Rotate()` from `Create()` in [pkg/util/logfile/logfile.go](/Users/jan/git/rancher-desktop-daemon/pkg/util/logfile/logfile.go) is the right refactor: it keeps numbering/pruning behavior in one place instead of duplicating log-rotation rules.
- Rotating serial logs before launching the hostagent in [pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:407](/Users/jan/git/rancher-desktop-daemon/pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go#L407) matches the real lifecycle of those files, since the VM driver owns them and overwrites them on start.

### Testing Assessment
1. No integration test covers the failure-sensitive path where `RDD_KEEP_LOGS=1` and a `LimaVM` is deleted while its log files are still being released; that is the path most likely to fail on Windows.
2. No end-to-end test covers `rdd svc delete` preserving instance logs and actually removing `LIMA_HOME` afterward.
3. No integration test verifies serial-log rotation across real VM restarts; the new coverage is unit-only in [pkg/util/logfile/logfile_test.go](/Users/jan/git/rancher-desktop-daemon/pkg/util/logfile/logfile_test.go).

### Documentation Assessment
The design docs were updated consistently for `RDD_KEEP_LOGS`. One small mismatch remains: the header comment in [scripts/collect-bats-logs.sh](/Users/jan/git/rancher-desktop-daemon/scripts/collect-bats-logs.sh) still says the script collects service and hostagent logs, but the script now also collects preserved instance-log directories.

### Coverage Summary
- `.github/actions/spelling/expect/rd.txt` — Trivial
- `docs/design/cmd_service.md` — Reviewed, no issues
- `docs/design/environment.md` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` — Finding 1
- `pkg/instance/logs.go` — Finding 1
- `pkg/instance/logs_test.go` — Reviewed, no issues
- `pkg/service/cmd/service.go` — Findings 1, 2
- `pkg/util/logfile/logfile.go` — Reviewed, no issues
- `pkg/util/logfile/logfile_test.go` — Reviewed, no issues
- `scripts/collect-bats-logs.sh` — Reviewed, no issues


### Gemini 3.1 Pro

### Executive Summary

This PR successfully introduces log preservation mechanisms to prevent critical debugging information from being lost across VM restarts and deletions, extracting a reusable `Rotate` function for serial logs and moving instance logs to a persistent directory when `RDD_KEEP_LOGS` is set. However, there are a few important edge cases on Windows where surviving VM processes hold locks that cause log preservation (and subsequent instance deletion) to fail, and a minor correctness gap in `Rotate` that logs noisy errors on fresh VM boots.

### Findings

#### Critical Issues

None.

#### Important Issues

1. **`logfile.Rotate` returns an error if the directory does not exist** — `pkg/util/logfile/logfile.go:64-67` (important, regression)

```go
	entries, err := os.ReadDir(dir)
	if err != nil {
		return fmt.Errorf("read log directory %s: %w", dir, err)
	}
```

The newly extracted `Rotate` function fails with an error wrapping `fs.ErrNotExist` if `dir` does not exist. When `Rotate` is called by `startInstance` for the serial logs on the first boot of a fresh VM, `inst.Dir` may not exist yet. This results in a noisy `Failed to rotate serial log` error logged to the daemon for every fresh VM creation.

Fix: Catch and ignore `fs.ErrNotExist` so that rotation gracefully becomes a no-op on an empty or nonexistent directory.

```go
	entries, err := os.ReadDir(dir)
	if err != nil {
		if errors.Is(err, fs.ErrNotExist) {
			return nil
		}
		return fmt.Errorf("read log directory %s: %w", dir, err)
	}
```

2. **`preserveAllInstanceLogs` silently fails on Windows if instances are still running** — `pkg/service/cmd/service.go:424-425` (important, gap)

```go
	_ = Stop()
	preserveAllInstanceLogs()
```

When `rdd svc delete` is invoked in CI, `Stop()` attempts to shut down the control plane. As documented in `StopWithWait`, this often falls back to `TerminateProcess` on Windows, killing the daemon but leaving the VM driver and hostagent running in their own process groups. Immediately after, `preserveAllInstanceLogs()` calls `instance.PreserveLogs` which attempts to `os.Rename` the log files. Because the orphaned VM processes are still running, they hold file locks on the logs. On Windows, `os.Rename` fails with `ERROR_SHARING_VIOLATION`. The logs are not preserved, defeating the PR's purpose for CI on Windows, and the subsequent `os.RemoveAll` calls will fail because the files remain locked.

Fix: `Delete()` should explicitly terminate surviving instance processes before attempting to preserve logs and delete directories. Iterate over `instance.LimaHome()`, read the PID files, and call `process.KillTree` on them (similar to `stopInstanceForcibly`) to ensure file locks are released.

#### Suggestions

1. **Log preservation fails across filesystems** — `pkg/instance/logs.go:54-56` (suggestion, gap)

```go
		if err := os.Rename(src, dst); err != nil {
			return count, fmt.Errorf("preserve %s: %w", entry.Name(), err)
		}
```

`os.Rename` fails with an "invalid cross-device link" (`EXDEV`) error if the source (`LimaHome`) and destination (`LogDir`) directories are on different filesystems or drives. If a user customizes their Windows storage by mounting `AppData` or `.rd2` on a secondary drive, log preservation will fail.

Fix: Fall back to `io.Copy` followed by `os.Remove(src)` if `os.Rename` fails with an `EXDEV` error. 

2. **`PreserveLogs` leaves an empty directory if log preservation fails** — `pkg/instance/logs.go:54-58` (suggestion, gap)

```go
		if err := os.Rename(src, dst); err != nil {
			return count, fmt.Errorf("preserve %s: %w", entry.Name(), err)
		}
```

`PreserveLogs` creates `destDir` via `nextAvailableDir` before iterating over `logFiles`. If the very first `os.Rename` operation fails (e.g., due to a Windows sharing violation as noted in Finding 2, or a permissions error), the function returns an error immediately, leaving behind a newly created, empty directory (e.g., `opensuse.2`). Successive failures will litter the log directory with empty incremented folders.

Fix: Track whether any files were successfully moved. If the function is returning an error and `count == 0`, attempt to `os.Remove(destDir)` before returning to clean up the empty directory.

### Design Observations

#### Strengths

- Extracting `Rotate` into a standalone helper that works on existing logs without creating new ones is an elegant, non-intrusive way to preserve serial logs written by external drivers (QEMU/VZ) without having to intercept or modify their configurations.

### Testing Assessment

1. **`TestRotateNoDir`**: The function `Rotate` should be tested to ensure it safely returns `nil` when the target directory does not yet exist.
2. **`TestPreserveLogs_PartialSuccess`**: If `os.Rename` fails midway through the loop, `PreserveLogs` should return the count of successfully moved files and properly clean up the destination directory if `count == 0`.
3. **`TestPreserveLogs_CrossDevice`**: `PreserveLogs` should successfully fall back to moving files when `os.Rename` returns a cross-device error, if the fallback is implemented.

### Documentation Assessment

The documentation updates effectively cover the new log preservation commands and lifecycle behaviors.

### Commit Structure

The commits are logically structured and self-contained, cleanly separating the CLI deletion preservation, VM lifecycle preservation, and serial log rotation.

### Acknowledged Limitations

- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go:61` — Code acknowledges that on Windows, "StatusBroken instances may have stale PID files whose PIDs were recycled to unrelated processes. stopInstanceForcibly uses taskkill, which kills by PID without verifying the process identity... The proper fix is to validate process identity... or use Windows Job Objects to track child processes without relying on PID files." This PR does not make this materially worse since `preserveInstanceLogs` passively reads whatever files are left behind.

### Coverage Summary

- `.github/actions/spelling/expect/rd.txt` — Trivial
- `docs/design/cmd_service.md` — Reviewed, no issues
- `docs/design/environment.md` — Reviewed, no issues
- `pkg/controllers/lima/limavm/controllers/limavm_lifecycle.go` — Reviewed, no issues
- `pkg/instance/logs.go` — Findings 3 and 4
- `pkg/instance/logs_test.go` — Reviewed, no issues
- `pkg/service/cmd/service.go` — Finding 2
- `pkg/util/logfile/logfile.go` — Finding 1
- `pkg/util/logfile/logfile_test.go` — Reviewed, no issues
- `scripts/collect-bats-logs.sh` — Reviewed, no issues
