Deep Review: 20260325-122724-pr-234
| Date | 2026-03-25 12:27 |
| Repo | rancher-sandbox/rancher-desktop-daemon |
| Round | 1 |
| Author | @jandubois |
| PR | #234 — Preserve Lima serial logs across VM restarts |
| Branch | rotate-serial-logs |
| Commits | d86eced Preserve Lima serial logs across VM restarts 75d179e Preserve instance logs on LimaVM deletion when RDD_KEEP_LOGS is set 0bad155 Preserve instance logs in svc delete when RDD_KEEP_LOGS is set |
| Reviewers | Claude Opus 4.6 Codex GPT 5.4 Gemini 3.1 Pro |
| Verdict | Merge with fixes — premature pruning bug when active log is missing; minor error-handling gaps |
| Wall-clock time | 12 min 39 s |
Executive Summary¶
This PR extracts a Rotate() helper from logfile.Create(), rotates serial logs before each VM start, and preserves instance logs on deletion when RDD_KEEP_LOGS is set. The design is clean: rotation reuses the existing numbering and pruning logic, preservation is non-fatal, and the three commits decompose naturally. Two bugs need fixing: Rotate prunes a backup when the active log is absent, and Rotate silently swallows non-ErrNotExist stat errors.
Critical Issues¶
None.
Important Issues¶
pruneOldFiles always receives nextN (maxN + 1) as its currentN argument, regardless of whether a rename occurred. When the active log is absent, no backup is created, yet the cutoff advances by one. With five existing backups (serial.1.log through serial.5.log), pruneOldFiles computes cutoff = 6 - 5 = 1 and deletes serial.1.log, reducing the pool to four files without adding a new one. Each call with a missing active log loses one backup.
numberedFiles = append(numberedFiles, numberedFile{n: n, name: entry.Name()})
if n > maxN {
maxN = n
}
}
nextN := maxN + 1
filePath := filepath.Join(dir, name+".log")
// Rename the current log to a numbered backup.
if _, err := os.Lstat(filePath); err == nil {
numberedName := fmt.Sprintf("%s.%d.log", name, nextN)
if err := os.Rename(filePath, filepath.Join(dir, numberedName)); err != nil {
return fmt.Errorf("rename %s to %s: %w", filePath, numberedName, err)
}
numberedFiles = append(numberedFiles, numberedFile{n: nextN, name: numberedName})
}
if !keepAll {
pruneOldFiles(dir, numberedFiles, nextN)
}
return nil
}
Fix: Pass maxN instead of nextN when the rename was skipped.
+ currentN := maxN
if _, err := os.Lstat(filePath); err == nil {
numberedName := fmt.Sprintf("%s.%d.log", name, nextN)
if err := os.Rename(filePath, filepath.Join(dir, numberedName)); err != nil {
return fmt.Errorf("rename %s to %s: %w", filePath, numberedName, err)
}
numberedFiles = append(numberedFiles, numberedFile{n: nextN, name: numberedName})
+ currentN = nextN
}
if !keepAll {
- pruneOldFiles(dir, numberedFiles, nextN)
+ pruneOldFiles(dir, numberedFiles, currentN)
}
If Lstat fails with something other than ErrNotExist (e.g., permission denied), Rotate skips the rename and returns nil. The caller proceeds to overwrite the log, losing its contents without warning. Non-existence errors should skip the rename; all others should propagate.
// Rename the current log to a numbered backup.
if _, err := os.Lstat(filePath); err == nil {
numberedName := fmt.Sprintf("%s.%d.log", name, nextN)
if err := os.Rename(filePath, filepath.Join(dir, numberedName)); err != nil {
return fmt.Errorf("rename %s to %s: %w", filePath, numberedName, err)
}
numberedFiles = append(numberedFiles, numberedFile{n: nextN, name: numberedName})
}
Fix:
if _, err := os.Lstat(filePath); err == nil {
// ... rename logic ...
+ } else if !errors.Is(err, fs.ErrNotExist) {
+ return fmt.Errorf("stat %s: %w", filePath, err)
}
Suggestions¶
If one file fails to rename (e.g., held open by a lingering process on Windows, or due to an EXDEV cross-device link error), the function returns immediately and skips all remaining files. For a best-effort debugging aid gated behind an environment variable, aggregating errors and preserving as many files as possible is more useful.
count := 0
for _, entry := range logFiles {
src := filepath.Join(srcDir, entry.Name())
dst := filepath.Join(destDir, entry.Name())
if err := os.Rename(src, dst); err != nil {
return count, fmt.Errorf("preserve %s: %w", entry.Name(), err)
}
count++
}
Fix: Collect errors and return errors.Join(errs...) at the end.
+ var errs []error
count := 0
for _, entry := range logFiles {
src := filepath.Join(srcDir, entry.Name())
dst := filepath.Join(destDir, entry.Name())
if err := os.Rename(src, dst); err != nil {
- return count, fmt.Errorf("preserve %s: %w", entry.Name(), err)
+ errs = append(errs, fmt.Errorf("preserve %s: %w", entry.Name(), err))
+ continue
}
count++
}
- return count, nil
+ return count, errors.Join(errs...)
Delete() removes ShortDir() (which contains LimaHome) with a raw os.RemoveAll. The normal LimaVM deletion path goes through limainstance.Delete, which lets the Lima driver run platform-specific cleanup (e.g., WSL2 distro unregister on Windows). Stop() at line 402 stops the service, but controllers may not complete full instance cleanup before returning. On macOS (VZ driver) this is harmless — VMs are just files. On Windows/WSL2, it could leave registered distros behind.
func Delete() error {
if !Exists() {
return fmt.Errorf("%q control plane does not exist", instance.Name())
}
_ = Stop()
preserveAllInstanceLogs()
if os.Getenv("RDD_KEEP_LOGS") == "" {
_ = os.RemoveAll(instance.LogDir())
}
_ = os.RemoveAll(instance.ShortDir())
return os.RemoveAll(instance.Dir())
}
This is future-facing: the project has no Windows support yet. Worth a comment documenting the limitation. If Windows support lands before this is revisited, add Lima-level instance deletion before the RemoveAll.
New code in logs.go:69 uses the modern errors.Is(err, fs.ErrExist), while service.go:420 uses the legacy os.IsNotExist. Both work identically for *os.PathError, but mixing styles within a PR is a minor inconsistency.
entries, err := os.ReadDir(instance.LimaHome())
if err != nil {
if !os.IsNotExist(err) {
logrus.WithError(err).Warn("Failed to read Lima instance directory")
}
return
}
Fix: Use errors.Is(err, fs.ErrNotExist) at line 420 to match logs.go. Or leave it — this matches the existing pattern in service.go.
Unlike Create (which calls os.MkdirAll), Rotate assumes the directory exists. Current callers satisfy this, but the exported API does not document the precondition.
// Rotate renames {name}.log to {name}.{N}.log without creating a new file.
// Use this for logs managed by an external process (e.g., VM serial console
// output written by the VM driver) that would be overwritten on next start.
//
// When keepAll is false, old numbered files beyond the retention count
// are removed.
func Rotate(dir, name string, keepAll bool) error {
pattern := regexp.MustCompile(`^` + regexp.QuoteMeta(name) + `\.(\d+)\.log$`)
entries, err := os.ReadDir(dir)
if err != nil {
return fmt.Errorf("read log directory %s: %w", dir, err)
}
Fix: Add "The directory must already exist" to the docstring.
// Rotate renames {name}.log to {name}.{N}.log without creating a new file.
+// The directory must already exist (unlike Create, which creates it).
// Use this for logs managed by an external process ...
Design Observations¶
Strengths
- Clean
Rotate/Createdecomposition. Extractingrotate()preserves all existing behavior (numbering, pruning, Lstat guard) while making rotation independently usable for serial logs.Createbecomes a thin wrapper. Claude Gemini Codex - Non-fatal log preservation. Both controller-side (
preserveInstanceLogs) and CLI-side (preserveAllInstanceLogs) log errors and continue rather than blocking deletion. For a debugging aid gated behind an environment variable, this is the right trade-off. Claude - Race-safe
nextAvailableDir. Usingos.Mkdir(notMkdirAll) as a test-and-create avoids races between concurrent callers without locks. TheErrExistcheck handles collisions correctly. Claude - Pre-filtering before directory creation.
PreserveLogsfilters for.logfiles before callingnextAvailableDir, avoiding empty directories when no logs exist. Claude - Correct ordering in
handleDeletion.preserveInstanceLogsruns afterStopForciblybut beforelimainstance.Delete, ensuring files exist and no process is writing to them. Claude
Concerns
- Cross-device rename in
PreserveLogs(future) —os.RenamereturnsEXDEVwhen source and destination live on different filesystems. Both paths are typically under$HOME, but users with non-standard mounts would fail consistently. A copy-and-delete fallback would make preservation robust across all storage layouts. Claude Gemini
Testing Assessment¶
Unit test coverage is solid:
logfile.Rotate: 4 test cases (no file, rename, sequential numbering, pruning)instance.PreserveLogs: 3 test cases (moves logs, skips non-logs, handles missing source)instance.nextAvailableDir: 3 test cases (first call, collision, missing parent)
Untested scenarios, ranked by risk:
- Rotate with existing backups but no active log — would catch Finding I1 (premature pruning). Set up numbered backups, omit the active log, verify no backup is deleted.
PreserveLogspartial rename failure — verify remaining files are still preserved (relevant after adopting the aggregation fix from Suggestion S1).- End-to-end serial log rotation through a VM restart — current tests exercise
Rotatein isolation; no integration test proves serial logs survive an actual restart.
Documentation Assessment¶
Documentation is accurate and well-organized. environment.md clearly explains the three effects of RDD_KEEP_LOGS. cmd_service.md now describes the directories svc delete removes. The spelling dictionary adds serialp and serialv.
Commit Structure¶
The three commits decompose cleanly: rotation primitive, controller-side preservation, CLI-side preservation. Each compiles and tests independently. Messages describe both what and why.
Agent Performance Retro¶
Claude Opus 4.6
Claude produced the most thorough review, covering all 12 files and identifying design strengths that other agents missed (race-safe nextAvailableDir, pre-filtering, correct ordering in handleDeletion). Its sole important finding — cross-device rename — is valid but low-risk (same filesystem in practice). The two suggestions (docstring precondition, os.IsNotExist inconsistency) are actionable. No false positives. Claude missed the premature pruning bug and the silent Lstat error that Gemini caught.
Codex GPT 5.4
Codex focused deeply on one finding: svc delete bypassing Lima's driver cleanup on WSL2. The analysis is technically sound but overstates the severity — the project has no Windows support yet, making this future-facing rather than a current regression. Codex missed all findings from both other agents (Rotate bugs, PreserveLogs abort, cross-device rename). Coverage was complete but shallow outside the one finding.
Gemini 3.1 Pro
Despite hitting a 429 rate limit during the review, Gemini produced the highest-signal findings: the premature pruning bug (I1) and the silent Lstat error (I2) are both real regressions that the other agents missed. The PreserveLogs abort-on-first-failure observation is also valid. The review is concise and well-structured. Gemini missed the cross-device rename concern and the svc delete/WSL2 issue.
| Metric | Claude Opus 4.6 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 5:52 | 4:33 | 3:51 |
| Critical | 0 | 0 | 0 |
| Important | 1 | 1 | 3 |
| Suggestion | 2 | 0 | 0 |
| Design observations | 6 | 1 | 2 |
| False positives | 0 | 0 | 0 |
| Unique insights | 2 (docstring, style) | 1 (WSL2 cleanup) | 2 (pruning bug, Lstat error) |
| Files reviewed | 12 | 12 | 12 |
| Coverage misses | 0 | 0 | 0 |
Gemini delivered the most impactful review despite rate-limiting issues. Its two unique findings are real regressions that need fixing. Claude provided the deepest architectural analysis. Codex's sole finding, while technically valid, addresses a future concern rather than a current bug.