Deep Review: 20260325-180207-pr-234
| Date | 2026-03-25 18:02 |
| Repo | rancher-sandbox/rancher-desktop-daemon |
| Round | 2 |
| Author | @jandubois |
| PR | #234 — Preserve Lima serial logs across VM restarts |
| Branch | rotate-serial-logs |
| Commits | 2ce15bc Preserve Lima serial logs across VM restartsdf1d963 Preserve instance logs on LimaVM deletion when RDD_KEEP_LOGS is set480661b Preserve instance logs in svc delete when RDD_KEEP_LOGS is set |
| Reviewers | Claude Opus 4.6, Codex GPT 5.4, Gemini 3.1 Pro |
| Verdict | Merge with fixes — Windows file-lock race undermines log preservation in CI |
| Wall-clock time | 14 min 40 s |
Consolidated Review ¶
Executive Summary ¶
This PR adds three log preservation mechanisms: serial log rotation before each VM start, instance log preservation during LimaVM deletion, and the same during rdd svc delete — all gated on RDD_KEEP_LOGS. The implementation extracts a clean Rotate helper from Create and introduces a shared PreserveLogs function with atomic directory naming. The main concern is that both preservation paths treat failures as best-effort warnings, then proceed to delete the source directories — on Windows, where killed processes retain file locks briefly, this race will cause the preservation to fail and the logs to be deleted, defeating the PR’s purpose for Windows CI.
Structure: 0 critical, 1 important, 4 suggestions.
Critical Issues ¶
None.
Important Issues ¶
Both preservation paths — preserveInstanceLogs in the controller and preserveAllInstanceLogs in the service — log failures as warnings, then allow the caller to delete the source directory. In the controller, preserveInstanceLogs returns void at line 714, and handleDeletion calls limainstance.Delete at line 82 regardless. In the service, preserveAllInstanceLogs continues to the next instance at line 454, and Delete() calls os.RemoveAll(instance.ShortDir()) at line 428.
// preserveInstanceLogs moves log files from the Lima instance directory to
// a subdirectory of the service log directory before the instance is deleted.
// This is a no-op unless RDD_KEEP_LOGS is set.
func preserveInstanceLogs(ctx context.Context, inst *limatype.Instance) {
if os.Getenv("RDD_KEEP_LOGS") == "" {
return
}
logger := log.FromContext(ctx)
count, err := instance.PreserveLogs(inst.Dir, inst.Name)
if err != nil {
logger.Error(err, "Failed to preserve instance logs")
return
}
if count > 0 {
logger.Info("Preserved instance logs", "instance", inst.Name, "count", count)
}
}
The handleDeletion flow calls preserveInstanceLogs at line 77, then unconditionally proceeds to limainstance.Delete at line 82:
if existingInst != nil {
// Only use PID-based force-stop for Running instances. Broken
// instances may have stale PID files pointing to recycled processes
// on Windows (Lima's ReadPIDFile treats any live PID as valid).
// Not tested: simulating stale PID files requires Windows-specific
// PID file manipulation that BATS cannot easily reproduce.
if existingInst.Status == limatype.StatusRunning {
stopInstanceForcibly(ctx, logger, existingInst)
} else if existingInst.VMType == limatype.WSL2 {
// A "stopped" WSL2 distro can retain kernel state that deadlocks
// wsl.exe --unregister. Terminate it without PID-based killing,
// since the PIDs may have been recycled on Windows.
terminateWSL2Distro(ctx, logger, existingInst.Name)
}
if runtime.GOOS == "windows" {
// Clear PIDs so Lima's Delete → StopForcibly does not kill
// unrelated processes if the PIDs were recycled. Windows recycles
// PIDs aggressively, and Lima's ReadPIDFile treats any live PID
// as valid. On Unix, PID recycling is rare (wraps around 32768+),
// so we let Lima's Delete clean up any surviving driver processes.
//
// This disables Lima's internal kill retry even if stopInstanceForcibly
// failed above. That is intentional: a failed kill means KillTree
// could not reach the process (access denied, already reaped), and
// the PID may already be recycled. Retrying with a stale PID is
// worse than letting Delete proceed without a kill.
existingInst.DriverPID = 0
existingInst.HostAgentPID = 0
}
preserveInstanceLogs(ctx, existingInst)
logger.Info("Deleting Lima instance", "instance", limaVM.Name)
// Use a timeout because Lima's WSL2 driver calls wsl.exe --unregister
// which can hang indefinitely if the WSL subsystem is degraded.
deleteCtx, deleteCancel := context.WithTimeout(ctx, time.Minute)
err = limainstance.Delete(deleteCtx, existingInst, true)
deleteCancel()
if err != nil {
logger.Error(err, "Failed to delete Lima instance")
return ctrl.Result{}, err
}
logger.Info("Deleted Lima instance", "instance", limaVM.Name)
In service.go, Delete() calls Stop(), then preserveAllInstanceLogs(), then removes the directories unconditionally:
func Delete() error {
if !Exists() {
return fmt.Errorf("%q control plane does not exist", instance.Name())
}
_ = Stop()
preserveAllInstanceLogs()
if os.Getenv("RDD_KEEP_LOGS") == "" {
_ = os.RemoveAll(instance.LogDir())
}
_ = os.RemoveAll(instance.ShortDir())
return os.RemoveAll(instance.Dir())
}
// preserveAllInstanceLogs moves .log files from each Lima instance directory
// to the service log directory before the instance directories are deleted.
// This is a no-op unless RDD_KEEP_LOGS is set.
func preserveAllInstanceLogs() {
if os.Getenv("RDD_KEEP_LOGS") == "" {
return
}
entries, err := os.ReadDir(instance.LimaHome())
if err != nil {
if !os.IsNotExist(err) {
logrus.WithError(err).Warn("Failed to read Lima instance directory")
}
return
}
for _, entry := range entries {
if !entry.IsDir() {
continue
}
instDir := filepath.Join(instance.LimaHome(), entry.Name())
count, err := instance.PreserveLogs(instDir, entry.Name())
if err != nil {
logrus.WithError(err).WithField("instance", entry.Name()).Warn("Failed to preserve instance logs")
continue
}
if count > 0 {
logrus.WithFields(logrus.Fields{"instance": entry.Name(), "count": count}).Info("Preserved instance logs")
}
}
}
On Windows, stopInstanceForcibly (line 55) and Stop() (line 423) terminate processes, but Windows does not release file locks until the process handle is fully closed. The subsequent os.Rename calls in PreserveLogs (logs.go:54) fail with ERROR_SHARING_VIOLATION, the error is logged, and the source logs are then deleted — the exact outcome this PR aims to prevent.
On Unix this race is unlikely because killed processes release file descriptors immediately, and os.Rename does not require exclusive access. The practical impact is limited to Windows CI, which is the PR author’s primary concern (issue #247).
Fix: In preserveInstanceLogs, return an error and let handleDeletion decide whether to retry or proceed. In Delete(), add a brief time.Sleep or poll for file accessibility after Stop() on Windows before calling preserveAllInstanceLogs. Alternatively, accept the best-effort design but document the Windows limitation with a TODO.
-func preserveInstanceLogs(ctx context.Context, inst *limatype.Instance) {
+func preserveInstanceLogs(ctx context.Context, inst *limatype.Instance) error {
if os.Getenv("RDD_KEEP_LOGS") == "" {
- return
+ return nil
}
logger := log.FromContext(ctx)
count, err := instance.PreserveLogs(inst.Dir, inst.Name)
if err != nil {
- logger.Error(err, "Failed to preserve instance logs")
- return
+ return fmt.Errorf("preserve instance logs: %w", err)
}
if count > 0 {
logger.Info("Preserved instance logs", "instance", inst.Name, "count", count)
}
+ return nil
}
Suggestions ¶
Rotate propagates os.ReadDir errors, including fs.ErrNotExist. In startInstance (line 410 of limavm_lifecycle.go), inst.Dir always exists because store.Inspect has already succeeded, so this does not trigger in practice. However, as a public API, Rotate would be more robust if it treated a missing directory as a no-op — there are no logs to rotate.
func Rotate(dir, name string, keepAll bool) error {
pattern := regexp.MustCompile(`^` + regexp.QuoteMeta(name) + `\.(\d+)\.log$`)
entries, err := os.ReadDir(dir)
if err != nil {
return fmt.Errorf("read log directory %s: %w", dir, err)
}
// Find the highest existing sequence number.
maxN := 0
var numberedFiles []numberedFile
for _, entry := range entries {
matches := pattern.FindStringSubmatch(entry.Name())
if matches == nil {
continue
Fix: Return nil for fs.ErrNotExist.
entries, err := os.ReadDir(dir)
if err != nil {
+ if errors.Is(err, fs.ErrNotExist) {
+ return nil
+ }
return fmt.Errorf("read log directory %s: %w", dir, err)
}
os.Rename returns EXDEV when source and destination are on different filesystems. In practice, LimaHome() (under ~/.rd{suffix}/lima/) and LogDir() (under ~/Library/Application Support/rancher-desktop-{instance}/log/) are on the same filesystem. On Windows, both live under %LOCALAPPDATA%. A cross-device failure would require deliberate filesystem layout changes.
count := 0
for _, entry := range logFiles {
src := filepath.Join(srcDir, entry.Name())
dst := filepath.Join(destDir, entry.Name())
if err := os.Rename(src, dst); err != nil {
return count, fmt.Errorf("preserve %s: %w", entry.Name(), err)
}
count++
}
Fix: If this ever becomes an issue, add a copy-then-delete fallback. For now, a comment noting the assumption suffices.
nextAvailableDir creates the destination directory before the rename loop. If the first os.Rename fails, the directory is left empty. Successive failures accumulate empty numbered directories (opensuse, opensuse.2, opensuse.3, ...).
logDir := LogDir()
if err := os.MkdirAll(logDir, 0o700); err != nil {
return 0, fmt.Errorf("create log directory %s: %w", logDir, err)
}
destDir, err := nextAvailableDir(logDir, instanceName)
if err != nil {
return 0, err
}
count := 0
for _, entry := range logFiles {
src := filepath.Join(srcDir, entry.Name())
dst := filepath.Join(destDir, entry.Name())
if err := os.Rename(src, dst); err != nil {
return count, fmt.Errorf("preserve %s: %w", entry.Name(), err)
}
count++
}
Fix: If count == 0 on error return, remove the empty directory.
+ if count == 0 {
+ os.Remove(destDir)
+ }
return count, fmt.Errorf("preserve %s: %w", entry.Name(), err)
logs_test.go uses the older os.IsNotExist(err) pattern while logfile_test.go (in the same PR) uses the modern errors.Is(err, fs.ErrNotExist). Both work, but the inconsistency within one PR is avoidable.
Fix: Align logs_test.go to use errors.Is(err, fs.ErrNotExist) to match logfile_test.go.
Design Observations ¶
Strengths
- Clean
Rotateextraction. Factoring rotation logic fromCreateintoRotateenables serial log rotation without creating a new file, while keepingCreate’s behavior identical. The shared numbering and pruning code eliminates duplication. ClaudeCodexGemini - Belt-and-suspenders preservation. The controller’s
preserveInstanceLogs(API-driven deletion) and the service’spreserveAllInstanceLogs(svc delete) cover complementary paths without interfering. If the controller already deleted an instance,preserveAllInstanceLogsfinds nothing to move. If the controller was interrupted byStop(), the service path catches remaining logs. Claude - Atomic directory naming.
nextAvailableDirusesos.Mkdirwithfs.ErrExistchecks for race-free directory creation. Two concurrent callers cannot claim the same directory name. The 1000-entry limit prevents runaway accumulation. Claude - Log filtering before directory creation.
PreserveLogschecks for.logfiles before callingnextAvailableDir, avoiding empty directories when no logs exist. Claude
Testing Assessment ¶
Test coverage is good. New unit tests cover:
logfile.Rotate: no-file (no-op), single rename, sequential numbering, pruning — 4 testsinstance.PreserveLogs: moves log files, skips non-logs and directories, no-log-files (no-op), nonexistent source — 4 testsinstance.nextAvailableDir: first call, collision numbering, nonexistent parent — 3 tests
Untested scenarios ranked by risk:
- Windows file-lock race in preservation — No test covers
RDD_KEEP_LOGS=1with a LimaVM deletion while processes still hold file locks. This is the path most likely to fail on Windows CI. PreserveLogspartial failure — If a rename fails midway, the function returns a partial count and an error. No test verifies the destination directory state (some files moved, empty directory left behind).- End-to-end
svc deletewith log preservation — No integration test coversrdd svc deletepreserving instance logs and removingLIMA_HOMEafterward.
Documentation Assessment ¶
Documentation updates are thorough:
environment.mdconsolidatesRDD_KEEP_LOGSbehavior into a dedicated “Log Preservation” section with sub-sections for rotation, preservation, and BATS defaults.cmd_service.mdadds a description for the previously-emptyrdd service deletesection.logfile.Rotateandinstance.PreserveLogshave complete godoc comments.- Internal functions (
preserveInstanceLogs,preserveAllInstanceLogs,nextAvailableDir) have clear doc comments.
One minor gap: the header comment in scripts/collect-bats-logs.sh still describes the script as collecting service and hostagent logs, but the script now also collects preserved instance-log directories. Codex
Commit Structure ¶
The three commits represent a clean progression:
2ce15bc— ExtractRotate, add serial log rotation. Self-contained.df1d963— Add instance log preservation during LimaVM deletion andShortDircleanup.480661b— ExtractPreserveLogshelper, addsvc deletepreservation. Refactors the inline implementation from commit 2 into a shared helper.
Commit messages accurately describe the changes. No fixup commits.
Agent Performance Retro ¶
Claude Opus 4.6 ¶
- Unique contributions: Identified the belt-and-suspenders design strength (controller + service paths complement each other without interfering). Noted the
os.IsNotExistvserrors.Isstyle inconsistency across the PR’s test files. Observed thatnextAvailableDirusesos.Mkdirfor atomic directory creation. - Accuracy: High. Both suggestions are valid, correctly scoped, and appropriately rated. No false positives.
- Depth: Read all changed files, traced into callers. Did not flag the Windows file-lock race that Codex and Gemini identified — a notable miss given the repo-specific context about Windows process lifecycle issues.
- Coverage: Complete. All 10 files reviewed with explicit status.
- Signal-to-noise: Excellent. Two targeted suggestions, no noise. The design strengths section provided genuinely useful observations.
Codex GPT 5.4 ¶
- Unique contributions: Identified that preservation failures are swallowed before deletion proceeds, tracing both the controller path (line 77 → line 82) and the service path (line 424 → line 428). Flagged the
ShortDirerror handling gap inDelete(). - Accuracy: Mostly accurate. Finding 1 correctly identifies the preserve-then-delete race. Finding 2 overstates the severity of the
ShortDirerror — the_ = Stop()pattern is pre-existing (commitf10522bf), and the new_ = os.RemoveAll(instance.ShortDir())follows the established convention. Codex’s fix to propagateStop()errors is out of scope for this PR. - Depth: Traced the full deletion flow through both code paths. Used line-level references.
- Coverage: Complete. All 10 files reviewed.
- Signal-to-noise: Good. Finding 1 is the review’s most valuable insight. Finding 2 mixes a pre-existing gap with the new change.
Gemini 3.1 Pro ¶
- Unique contributions: Flagged the empty directory left behind on partial
PreserveLogsfailure. Identified the Windows file-lock race from thesvc deleteperspective. Found the acknowledged limitation about stale PID files on Windows. - Accuracy: Mixed. Finding 1 (Rotate on nonexistent directory) overstates severity as IMPORTANT — the stated scenario (“first boot of a fresh VM,
inst.Dirmay not exist yet”) is incorrect becausestore.Inspectmust succeed beforestartInstanceis called, guaranteeing the directory exists. Finding 2 (Windows file locks) is valid and complements Codex’s analysis. - Depth: Good use of git blame on findings. Traced into
StopWithWaitand process lifecycle code. - Coverage: Complete. All 10 files reviewed. Marked
limavm_lifecycle.goas “Reviewed, no issues” despite the file containing the preservation gap that Codex flagged — a coverage miss. - Signal-to-noise: Good. The empty-directory suggestion is a genuine improvement. The Rotate finding needed severity recalibration.
Summary ¶
| Metric | Claude Opus 4.6 | Codex GPT 5.4 | Gemini 3.1 Pro |
|---|---|---|---|
| Duration | 6:10 | 3:40 | 5:39 |
| Critical | 0 | 0 | 0 |
| Important | 0 | 2 | 2 |
| Suggestion | 2 | 0 | 2 |
| Design observations | 5 | 2 | 1 |
| False positives | 0 | 0 (1 overscoped) | 1 (severity inflation) |
| Unique insights | 2 | 2 | 2 |
| Files reviewed | 10 | 10 | 10 |
| Coverage misses | 1 (Windows race) | 0 | 1 (controller preserve gap) |
Codex delivered the most actionable finding (preserve-then-delete race) in the shortest time. Gemini independently confirmed the Windows angle and added the empty-directory cleanup suggestion. Claude provided the best design analysis and highest signal-to-noise ratio but missed the review’s central finding. The multi-agent approach proved its value: no single agent identified all the important issues.